Query complexity vs code complexity - sql

So my question is rather simple, what's better and faster, doing calculations in code(let's say java) or just doing complex database queries(if we assume we can do one same action in both ways)? Which approach is better in general, and why?

I'd do in the code.
Doing business calculations in the queries in the DB gets the logic of the app spread and not easly understandable, plus you often get bound to specific storage (i.e. SQL Server/Oracle/MySql/etc) leaving the possibility to chage storage paradigma (i.e. to a NoSQL DB).
Then in the code you can apply some injection to change easly the behavior of your code, making it more manageable.

I generally find it faster (in development time) to write a query to do what I need. The first round focuses on logical correctness, and I'll complete the rest of the functionality using that query. I try to avoid doing queries in loops in this phase, but otherwise I don't worry too much about performance.
Then, I'll look at the performance of the whole feature. If the query is too slow, I'll EXPLAIN it and look at the server's statistics. My first focus is on indexes, if that doesn't work I'll try restructuring the query. Sometimes correlated subqueries are faster than joins, sometimes unions are faster than disjunctions in the WHERE clause, sometimes it's the other way around.
If I can't get satisfactory performance using a single query, I may break it up into multiple queries and/or do some of the work in code. Such code tends to be more complicated and longer than an equivalent query, so I try to avoid it unless necessary.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

Subselect vs Slicer in MDX

If I want my results to be evaluated in the context of any tuple in MDX, but don't want this tuple to be part of the results, I use either of the below two options.
1. SUBSELECT
SELECT [Measures].[SomeMeasure] ON 0,
[DimName].[HierName].children ON 1
FROM
(SELECT foo.bar.&[Val] ON 0 FROM
[MyCube])
2. SLICER
SELECT [Measures].[SomeMeasure] ON 0,
[DimName].[HierName].children ON 1
FROM
[MyCube]
WHERE (foo.bar.&[Val])
Third option that came to my mind is EXISTSclause, but soon I realized that it is meant for something else altogether.
So, other aspects aside, I am interested in the general performance of these queries, any benchmarks or best practices to be kept in mind and which one to go for in which circumstances.
As mostly with optimizer questions, the answer is: It depends. I would say WHERE is faster in many situations, but there are cases where subselect is faster.
Optimizers are a normally not documented to each detail by vendors (even if some ore more documented than others, and Analysis Services is a typical example of an engine with a less documented optimizer). I would think they have many, many rules in their code like "if this and that, but not a third condition, then go along that route". And this constantly changes, hence any documentation would be outdated with more or less each hotfix.
As said, the situation is a bit better for many relational engines, where for SQL Server, you can at least show a plan that is more or less understandable. But even there you do not know why exactly the optimizer chose this plan and not another, and sometimes have to try several approaches to get the optimizer on another path (like using an index, ...). And a new release of SQL Server may handle things differently, hopefully better in most cases, but possibly worse in a few rare cases.
That clearly is also not a clear and documented way of writing code, but just trial and error.
In summary: You will have to test with your cube and your typical queries.
Anyway, in many cases, the performance difference is so small that it is not relevant.
Finally, the best documentation that is available for the Analysis Services optimizer is the old blog of one of the Analysis Services query engine developers at http://sqlblog.com/blogs/mosha/default.aspx. This being a blog, it is not very systematic, but just a collection of some random samples of optimizer behavior with the reasons behind it.
As far as I know, If you want to cache results of your queries and improve overall throughput then slicer is better, but if you just care about single query performance then you can get better performance with subselect.
Answering the question below,
Following information is from Chris Webb
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/c1fe120b-256c-425e-93a5-24278b2ab1f3/subselect-or-where-slice?forum=sqlanalysisservices
First of all, it needs to be said that subselects and the Where clause do two different things - they're not interchangeable in all circumstances, they may return different results, and sometimes one performs better, sometimes the other does, because they may result in different query plans being generated. One technique is not consistently 'better' than the other on all cubes, and any differences in performance may well change from service pack to service pack.
To answer the original question: use whichever you find gives you the best query performance and returns the results you want to see. In general I prefer the Where clause (which is not deprecated); the reason is that while a subselect may perform faster initially in some cases, it limits Analysis Services' ability to cache the result of calculations which means slower performance in the long term:
http://cwebbbi.spaces.live.com/blog/cns!7B84B0F2C239489A!3057.entry

Avoiding Nested Queries

How Important is it to avoid nested queries.
I have always learnt to avoid them like a plague. But they are the most natural thing to me. When I am designing a query, the first thing I write is a nested query. Then I convert it to joins, which sometimes takes a lot of time to get right. And rarely gives a big performance improvement (sometimes it does)
So are they really so bad. Is there a way to use nested queries without temp tables and filesort
It really depends, I had situations where I improved some queries by using subqueries.
The factors that I am aware are:
if the subquery uses fields from outer query for comparison or not (correlated or not)
if the relation between the outer query and sub query is covered by indexes
if there are no usable indexes on the joins and the subquery is not correlated and returns a small result it might be faster to use it
i have also run into situations where transforming a query that uses order by into a query that does not use it and than turning it into a simple subquery and sort that improves performance in mysql
Anyway, it is always good to test different variants (with SQL_NO_CACHE please), and turning correlated queries into joins is a good practice.
I would even go so far to call it a very useful practice.
It might be possible that if correlated queries are the first that come to your mind that you are not primarily thinking in terms of set operations, but primarily in terms of procedural operations and when dealing with relational databases it is very useful to fully adopt the set perspective on the data model and transformations on it.
EDIT:
Procedural vs Relational
Thinking in terms of set operations vs procedural boils down to equivalence in some set algebra expressions, for example selection on a union is equivalent to union of selections. There is no difference between the two.
But when you compare the two procedures, such as apply the selection criteria to every element of an union with make a union and then apply selection, the two are distinctly different procedures, which might have very different properties (for example utilization of CPU, I/O, memory).
The idea behind relational databases is that you do not try to describe how to get the result (procedure), but only what you want, and that the database management system will decide on the best path (procedure) to fulfil your request. This is why SQL is called 4th generation language (4GL).
One of the tricks that help you do that is to remind yourself that tuples have no inherent order (set elements are unordered).
Another is realizing that relational algebra is quite comprehensive and allows translation of requests (requirements) directly to SQL (if semantics of your model represent well the problem space, or in another words if meaning attached to the name of your tables and relationships is done right, or in another words if your database is designed well).
Therefore, you do not have to think how, only what.
In your case, it was just preference over correlated queries, so it might be that I am not telling you anything new, but you emphasized that point, hence the comment.
I think that if you were completely comfortable with all the rules that transform queries from one form into another (rules such as distributiveness) that you would not prefer correlated subqueries (that you would see all forms as equal).
(Note: above discusses theoretical background, important for database design; practically the above concepts deviate - not all equivalent rewrites of a query are necessarily executed as fast, clustered primary keys do make tables have inherit order on disk, etc... but these deviations are only deviations; the fact that not all equivalent queries execute as fast is an imperfection of the actual DBMS and not the concepts behind it)
Personally I prefer to avoid nested queries until they are necessary for the simple reason that nested queries can make the code less human readable and make debugging and collaboration more painful. I think nesting is acceptable if the nested query is something trivial or if temporary storage of large tables becomes an issue. But too many times I've seen complex nested queries within nested queries and it makes debugging painful.
I'm not sure how it looks like in MySQL 5.1 or 5.5, but in 5.0.x nested queries have usually horrible performance, because MySQL performs subquery for each row fetched from main query.
This probably isn't the case for more mature databases like MsSQL, which internally can rewrite nested queries to joins, but I've never used MsSQL so I don't know for sure.
http://dev.mysql.com/doc/refman/5.0/en/rewriting-subqueries.html
It is also true that on some occasions, it is not only possible to rewrite a query without a subquery, but it can be more efficient to make use of some of these techniques rather than to use subqueries. - which is rather funny statement, taking into account that for me so far all subqueries make database crawl.
Subqueries vs joins
I try to avoid nested queries because they're less readable.
But I do agree that they are easier to write. I mean, it's just easier to conceptualize when writing the code, IMO. But then, the deep nesting just makes reading the code very difficult. Make you you add some comment to tell the reader what the subquery is doing, so they don't need to read your subquery if they don't need to.
Also, as soon as it starts getting difficult to read, you might want to consider converting the subquery into a Common Table Expression. The conversion is easy to do, and also makes it much easier to read, since each CTE has a specific purpose.

Is premature optimization in SQL as "evil" as it is in procedural programming languages?

I'm learning SQL at the moment and I've read that joins and subqueries can potentially be performance destroyers. I (somewhat) know the theory about algorithmic complexity in procedural programming languages and try to be mindful of that when programming, but I don't know how expensive different SQL queries can be. I'm deciding whether I should invest time in learning about SQL performance or just notice it when my queries run slow. The base question for me then is: is premature optimization for SQL as evil as it is for procedural languages?
As added information, I work in an environment where, most of the time, high performance is not an issue and the biggest tables I have to work with have some 150k rows.
Here's the Donald Knuth quote I refer to when saying "evil":
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil. Yet we should not
pass up our opportunities in that
critical 3%.
I would say that some general notions about performance are a must-have : it'll prevent you from writing really bad queries that can hurt your application (Even if you don't have millions of rows in your tables).
It'll also help you design your database so it's more officient-oriented : you'll have some ideas about where to put indexes, for instance.
But you shouldn't have performance as a first goal : the first thing is to have an application that works ; and, then, if needed, you'll optimize it (having some performance notions while developping will help you have an application that's easier to optimize, though).
Note I would not say that "having notions about performances" is "premature optimization", as long as you don't just "optimize", but just "write correctly" ; I would rather call it good practice that'll help to write better quality code ;-)
What Knuth means is: it's really, really important to know about SQL optimisation but only when you need to. As you say, "most of the time ... high performance is not an issue."
It's that 3% of times when you do need high performance that it's important to know what rules to break and why.
However, unlike procedural languages, even for 150k rows it can be important to know a little about how your query is processed. For instance free text searching will be very slow compared with searching through exact matches on indexed columns. It's going the final steps into e.g. sharding or full denormalisation where most DBAs and developers draw the line.
I wouldn't say that SQL optimization has as many pitfalls as premature programming optimization. Designing your schema and queries ahead of time with performance in mind can help you avoid some really nasty redesigns later on. That being said, spending a day getting rid of a table scan can be utterly worthless to you in the long run if that query isn't a slow query, can be cached, or is rarely called in a manner that would impact your application.
I personally profile my queries and focus on the worst, and most used queries. Careful design ahead of time cuts out most of the worst.
I would say that you should make the SQL as easily readeble as possible, and only worry about the performance once it hits you.
That said.
Be mindfull of standard things as you develop, such as indexes, sub selects, use of cursors where a standard query would do the job, etc.
It will not hurt to develop the original correctly, and you can optimize the problems later when it is needed.
EDIT
Also remeber that maintainability of your SQL code is very important, and that debugging SQL is slightly more difficult than normal coding.
Knuth says "forget about 97%" but for a typical web app it's in the database IO where 97% of the request time is spent. This is where a little optimization effort can yield greatest results.
If this is the kind of apps you're writing I strongly suggest learning as much of how RDBMSes work as you can afford. Other people give you excellent suggestions, and I'd add that I usually follow this list top-down when deciding how to spent my "optimization budget":
Schema design. Think twelve times
about normalizaton and access
strategies. This will save you many
painful hours later.
Query readability. Related to #1,
sometimes trying to reogranize your
queries gives a better understanding
of how schema should look. Also it'll
help later when you ask for
help.
Avoid subqueries in SELECT list, use
JOINs.
If there are slow queries reach for
Profiler. Check for missing indexes
first And finally, if there are
still slow queries, try to rewrite
it.
Keep in mind also, that database performance very much depends on data distribution and number of simultaneous requests (because of locking). Even though a query completes in 1 sec. on your underpowered netbook it could take 15 seconds on the 8-core server. If possible, check your queries on actual data. If you know that concurrency is going to be high it's (paradoxically) better to use many small queries than one big one.
I agree with everything that's said here, and I'd like to add: make sure that your SQL is well-encapsulated so that, when you discover what needs to be optimized, there's only one place you need to change it, and the change will be transparent to whatever code calls it.
Personally, I like to encapsulate all of my SQL in PL/SQL procedures, but there are some who disagree with that. Whatever you do, I recommend trying to avoid putting your SQL "inline" with other sourcecode. That seems to always lead to cut-and-pasting and quickly becomes hard to maintain. Put your SQL elsewhere, and try to re-use it as much as possible.
Also, read up on indexes, how they really work, and when you should and shouldn't use them. A lot of people's first instinct, when they get a slow query, is to index the table to death. That might solve the problem in the short term, but long-term an over-index table will be slow to insert and update into. A few well-chosen indexes are much better than indexing every field. Try reading "Refactoring SQL Applications" by Stephane Faroult.
Finally, as said above, a properly normalized database design will help avoid 99% of your slow queries. Denormalization is neccesary sometimes, but it's important that you know the rules, before you break them.
Good luck!

is there such thing as a query being too big?

I don't have too much experience with SQL. Most of the queries I have written have been very small. Whenever I see a very large query, I always kinda assume it needs to be optimized. But is this true? or is there situations where really large queries are just whats needed?
BTW when I say large queries I mean queries that exceed 1000+ chars
Yes, any statement, method, or even query can be "too big".
The problem, is actually defining what too big really is.
If you can't sit down and figure out what the query does in a relatively short amount of time, it's probably best to break it up into smaller chunks.
I always like to look at things from a maintenance standpoint. If the query is hard to understand now, what if you have to debug something in it?
Just because you see a large query, doesn't mean it needs to be changed or optimized, but if it's too complicated for its own good, then you might want to consider refactoring.
Just as in other languages, you can't determine the efficiency of a query based on a character count. Also, 1000 characters isn't what I could call "large", especially when you use good table/column names, aliases that make sense, etc.
If you're not comfortable enough with SQL to be able to "eye ball" the design merits of particular query, run it through a profiler and examine the execution plan. That'll give you a good idea of problems, if any, the code in question will suffer from.
My rule of thumb is this: write the best, tightest, simplest code you can, and optimize where needed - ie, where you see a performance bottleneck or where (as frequently happens) you slap yourself in the head and say "D'OH!" at about three in the morning on vacation.
Summary:Code well, and optimize where needed.
As Robert said, if you can't easily tell what the query is doing, it probably needs to be simplified.
If you are used to writing simple stuff, you may not realize how complex getting information for a complex report might be. Yes, queries can get long and complicated and still perform well for what they are being asked to do. Often the techniques that are used to performance tune something may make the code look more complicated to those less familar with advanced querying techniques. What counts is how long it takes to execute and whether it returns the correct data, not how many characters it has.
When I see a complex query, my first thought is does it return what the developer really intended to return (you'd be surprised at how often the answer to that is no) and then I look to see if it could be performance tuned. Yes there are many badly written long queries out there, but there are also as many or more that do what they are intended to do about as fast as it can be done without a major database redesign or faster hardware.
I'd suggest that it's not the characters that should measure the size/complexity of the query.
I'd boil it down to:
what's the goal of the query?
does it used set-based logic?
does it re-use any components?
does it JOIN improperly/poorly?
what are the performance implications?
maintainability concerns - is it written so that another developer can grok its intentions?
Where I work we've created stored procedures that exceed 1000 characters. I can't really say it was NECESSARY but sometimes haste wins out over efficiency (most notably when a quick fix is necessary for a client).
Having said that ... if given the time I would attempt to optimize a query as small/efficient as it can get without it being overly confusing. I've used nested stored procedures to make things a little more clear and/or functions as well.
The number of characters does not mean that a query needs to be optimized - it is what you see within those characters that does.
Things like subqueries on top of subqueries is something I would review. I'd review JOINs as well, but it shouldn't take long comparing to the ERD to know if there's an unnecessary JOIN - the first thing I'd look at would be what tables are joined but not used in the output, which is fine if the tables are link/corrollary/etc tables.