Avoiding Nested Queries - sql

How Important is it to avoid nested queries.
I have always learnt to avoid them like a plague. But they are the most natural thing to me. When I am designing a query, the first thing I write is a nested query. Then I convert it to joins, which sometimes takes a lot of time to get right. And rarely gives a big performance improvement (sometimes it does)
So are they really so bad. Is there a way to use nested queries without temp tables and filesort

It really depends, I had situations where I improved some queries by using subqueries.
The factors that I am aware are:
if the subquery uses fields from outer query for comparison or not (correlated or not)
if the relation between the outer query and sub query is covered by indexes
if there are no usable indexes on the joins and the subquery is not correlated and returns a small result it might be faster to use it
i have also run into situations where transforming a query that uses order by into a query that does not use it and than turning it into a simple subquery and sort that improves performance in mysql
Anyway, it is always good to test different variants (with SQL_NO_CACHE please), and turning correlated queries into joins is a good practice.
I would even go so far to call it a very useful practice.
It might be possible that if correlated queries are the first that come to your mind that you are not primarily thinking in terms of set operations, but primarily in terms of procedural operations and when dealing with relational databases it is very useful to fully adopt the set perspective on the data model and transformations on it.
EDIT:
Procedural vs Relational
Thinking in terms of set operations vs procedural boils down to equivalence in some set algebra expressions, for example selection on a union is equivalent to union of selections. There is no difference between the two.
But when you compare the two procedures, such as apply the selection criteria to every element of an union with make a union and then apply selection, the two are distinctly different procedures, which might have very different properties (for example utilization of CPU, I/O, memory).
The idea behind relational databases is that you do not try to describe how to get the result (procedure), but only what you want, and that the database management system will decide on the best path (procedure) to fulfil your request. This is why SQL is called 4th generation language (4GL).
One of the tricks that help you do that is to remind yourself that tuples have no inherent order (set elements are unordered).
Another is realizing that relational algebra is quite comprehensive and allows translation of requests (requirements) directly to SQL (if semantics of your model represent well the problem space, or in another words if meaning attached to the name of your tables and relationships is done right, or in another words if your database is designed well).
Therefore, you do not have to think how, only what.
In your case, it was just preference over correlated queries, so it might be that I am not telling you anything new, but you emphasized that point, hence the comment.
I think that if you were completely comfortable with all the rules that transform queries from one form into another (rules such as distributiveness) that you would not prefer correlated subqueries (that you would see all forms as equal).
(Note: above discusses theoretical background, important for database design; practically the above concepts deviate - not all equivalent rewrites of a query are necessarily executed as fast, clustered primary keys do make tables have inherit order on disk, etc... but these deviations are only deviations; the fact that not all equivalent queries execute as fast is an imperfection of the actual DBMS and not the concepts behind it)

Personally I prefer to avoid nested queries until they are necessary for the simple reason that nested queries can make the code less human readable and make debugging and collaboration more painful. I think nesting is acceptable if the nested query is something trivial or if temporary storage of large tables becomes an issue. But too many times I've seen complex nested queries within nested queries and it makes debugging painful.

I'm not sure how it looks like in MySQL 5.1 or 5.5, but in 5.0.x nested queries have usually horrible performance, because MySQL performs subquery for each row fetched from main query.
This probably isn't the case for more mature databases like MsSQL, which internally can rewrite nested queries to joins, but I've never used MsSQL so I don't know for sure.
http://dev.mysql.com/doc/refman/5.0/en/rewriting-subqueries.html
It is also true that on some occasions, it is not only possible to rewrite a query without a subquery, but it can be more efficient to make use of some of these techniques rather than to use subqueries. - which is rather funny statement, taking into account that for me so far all subqueries make database crawl.
Subqueries vs joins

I try to avoid nested queries because they're less readable.
But I do agree that they are easier to write. I mean, it's just easier to conceptualize when writing the code, IMO. But then, the deep nesting just makes reading the code very difficult. Make you you add some comment to tell the reader what the subquery is doing, so they don't need to read your subquery if they don't need to.
Also, as soon as it starts getting difficult to read, you might want to consider converting the subquery into a Common Table Expression. The conversion is easy to do, and also makes it much easier to read, since each CTE has a specific purpose.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

Query complexity vs code complexity

So my question is rather simple, what's better and faster, doing calculations in code(let's say java) or just doing complex database queries(if we assume we can do one same action in both ways)? Which approach is better in general, and why?
I'd do in the code.
Doing business calculations in the queries in the DB gets the logic of the app spread and not easly understandable, plus you often get bound to specific storage (i.e. SQL Server/Oracle/MySql/etc) leaving the possibility to chage storage paradigma (i.e. to a NoSQL DB).
Then in the code you can apply some injection to change easly the behavior of your code, making it more manageable.
I generally find it faster (in development time) to write a query to do what I need. The first round focuses on logical correctness, and I'll complete the rest of the functionality using that query. I try to avoid doing queries in loops in this phase, but otherwise I don't worry too much about performance.
Then, I'll look at the performance of the whole feature. If the query is too slow, I'll EXPLAIN it and look at the server's statistics. My first focus is on indexes, if that doesn't work I'll try restructuring the query. Sometimes correlated subqueries are faster than joins, sometimes unions are faster than disjunctions in the WHERE clause, sometimes it's the other way around.
If I can't get satisfactory performance using a single query, I may break it up into multiple queries and/or do some of the work in code. Such code tends to be more complicated and longer than an equivalent query, so I try to avoid it unless necessary.

When a query is executed, what happens first in the back-end?

I'm having a query in COGNOS which would fetch me a huge volume of data. Since the execution time would be higher, I'd like to fine tune my query. Everyone knows that the WHERE clause in the query would get executed first.
My doubt is which would happen first when a query is executed?
The JOIN in the query would be established first or the WHERE clause would be executed first?
If JOIN is established first, I should specify the filters of the DIMENSION first else I should specify the filters of the FACT first.
Please explain me.
Thanks in advance.
The idea of SQL is that it is a high level declarative language, meaning you tell it what results you want rather than how to get them. There are exceptions to this in various SQL implementations such as hints in Oracle to use a specific index etc, but as a general rule this holds true.
Behind the scenes the optimiser for your RDBMS implements relational algebra to do a cost based estimate of the different potential execution plans and select the one that it predicts will be the most efficient. The great thing about this is that you do not need to worry what order you write your where clauses in etc, so long as all of the information is there the optimiser should pick the most efficient plan.
That being said there are often things that you can so on the database to improve query performance such as building indexes on columns in large tables that are often used in filtering criteria or joins.
Another consideration is whether you can use parallel hints to speed up your run time but this will depend on your query, the execution plan that is being used, the RDBMS you are using and the hardware it is running on.
If you post the query syntax and what RDBMS you are using we can check if there is anything obvious that could be amended in this case.
The order of filters definitely does not matter. The optimizer will take care of that.
As for filtering on the fact or dimension table - do you mean you are exposing the same field in your Cognos model for each (ex ProductID from both fact and Product dimension)? If so, that is not recommended. Generally speaking, you should expose the dimension field only.
This is more of a question about your SQL environment, though. I would export the SQL generated by Cognos from within Report Studio (Tools -> Show Generated SQL). From there, hopefully you are able to work with a good DBA to see if there are any obvious missing indexes, etc in your tables.
There's not a whole lot of options within Cognos to change the SQL generation. The prior poster mentions hints, which could work if writing native SQL, but that is a concept not known to Cognos. Really, all you can do is change the implict/explict join syntax which just controls whether the join happens in an ON statement or in the WHERE. Although the WHERE side is pretty ugly it generally compiles the same as ON.

Joins / Sub queries dilemma

I come across several instances where I can write a query with using both joins or sub queries. I usually use joins but sometimes use sub queries (without any reason). I have read in several places (including stackoverflow) that joins are faster than sub queries in many instances but sometimes subqueries are faster. Right now the queries I am writing does not deal with large amount of data so I guess the speed isn't much of a concern. But for future, I'm curious about the following.
a.) Why are joins faster than subqueries (in general).
b.) What are the instances when subqueries are faster. How will I know?
c.) If I'm writing a query, how should I judge whether I should use subquery or a join. I will appreciate if someone explains me with an example.
Saying that joins are 'mostly faster' than sub-queries is not true. This entirely depends an the DBMS used.
For Microsoft SQL Server I know that this is not true. Usually, the performance the the same. Not only in theory, but also in practice.
For MySQL I have heard that sub-queries are problematic. I don't have personal evidence.
Oracle seems to be about the same as SQL Server.
The answers to your questions.
a) Joins aren't faster then subqueries (in general). But often DBMSs produce a much smarter execution plan if you use joins. This is related two the procedure how queries are transformed into execution plans.
b) c) In general there are no rules for writing fast queries. Furthermore, there is only one way to choose the correct query for your task: You have to benchmark the different versions. So if you have to decide how to formulate a certain query benchmark the first and if it performs good, then stop. Else change something and benchmark it again and if it is fine, then stop. Use an environment that is close to your production environment: use realistic datasets. A query might perform well with thousands of records but not with millions. Use the same hardware as in production. Consider to benchmark the query in the context of your application, since other queries of these may influence the performance of it.
The main reason from the research I've done is that the compiler more directly utilizes the proper indexes when you explicitly state how to do the join (i.e. left join, inner join, etc.) If you use a sub-query, you are leaving it a bit up to the optimizer and it not always does the fastest way (which is retarded as its called an 'optimizer').
Anyway, it may be easier to write your sub-query, but if you are building a query for speed and long-term use, its clear that you should write out the explicit joins.
Here are some links with some views and examples:
Join vs. Subquery
Another link This ones gives some details why joins are faster (in most cases) than sub-queries.
more examples

temporary tables in SQL

I need to know if it is standard practice to decompose complex queries into parts and create temporary tables which are dropped at the end.? In OLAP applications it shouldnt be much of an issue, but in OLTP since speed matters is it avoided?.
For simple queries which are well-optimized by your DBMS, temporary tables are usually a bad idea because they introduce overhead.
But sometimes your DBMS will have a really hard time optimizing complex queries. At that point you have at least 5 options:
change your schema or indexes to make it easier for the optimizer to choose a better query plan
tweak your SQL to get the DBMS to choose the indexes, join strategies, etc. that you want and to work around known and unknown bugs in your DBMS's optimizer.
use "hints" to get the DBMS to choose the indexes, join strategies, etc. that you want.
Get the plan you want and use a "saved plan" to force its use by the DBMS.
use temp tables (or table variables, etc.) to decompose complex queries into simpler intermediate queries
There's no hard-and-fast rule about which option is best for any particular query. I've used all of the above strategies. I tend to choose the temp table approach when I don't own the schema, so I can't change it, and when I don't want to depend on hints or query tuning or saved plans (often because I don't want to expose myself to changes in the underlying schema made later).
Keep in mind that using temp tables to decompose queries will give you sub-optimal performance every time. But it's usually predictably sub-optimal. The worst case using temp tables isn't nearly as bad as when your DBMS chooses a bad plan for a single large query. This happens surprisingly often, especially in the face of changes in underlying schema, DBMS version changes, dev vs. production differences, etc.
Personally, I find that if a query gets to a level of complexity where I have to bend over backwards to get the DBMS to do what I want, and if I feel that maintainability of the application is at risk, then I'll often go with decomposition and temp tables if I can't change the schema or indexes.
Of course, in theory you shouldn't be running expensive, complex queries on your OLTP database, but in practice most applications are never "pure" OLTP-- there's always a few complicated, hard-to-optimize queries in any OLTP project.
The critical word in your question is "decompose". Temp tables and other strategies are generally discouraged and found to lead to lower overall performance. The optimizer is perfectly capable of using intermediate tables if they are useful for getting to the answer most quickly. Very rarely can you help the optimizer by coercing it with your own strategy.
The same thing goes for suggesting which indexes to use.
When you see this going on, almost always some one has more work to do refining their query statements.
The only time I've used temp tables during OLTP processing is when I am dealing with a batch of data that I need to analyze/join, and eventually do a data change operation on it (Insert/Update/Delete). I'll use temp tables for a) speed but more importantly, b) because the normal select/update or select/delete logic is either too complex or can't be done in one transactional statement.
For example, find 100k users who meet some condition, and insert them into an archive table and then delete them.
I don't recommend using temp tables in most cases for normal select statements. You can almost always get better performance with either proper indexing, better sql join/hints and/or changing the data structure to match data access paths.
IN oltp systems if the processing is part of the online system (i.e. not batch) then I can't recall ever using a temporary table. Using some sort of procedural logic is usually the way to go - e.g. PL/Sql in Oracle and so on.
In OLAP temporary tables are very common, usually load the data into a table, transform it and save the result in another table, and depending on the processing have a number of transform steps.
I'd go so far as to say if you have an oltp system and you need to use a temporary table, then something is incorrect, modify your design, or use procedural logic. In OLAP, temporary tables are very common.
hth