is there such thing as a query being too big? - sql

I don't have too much experience with SQL. Most of the queries I have written have been very small. Whenever I see a very large query, I always kinda assume it needs to be optimized. But is this true? or is there situations where really large queries are just whats needed?
BTW when I say large queries I mean queries that exceed 1000+ chars

Yes, any statement, method, or even query can be "too big".
The problem, is actually defining what too big really is.
If you can't sit down and figure out what the query does in a relatively short amount of time, it's probably best to break it up into smaller chunks.
I always like to look at things from a maintenance standpoint. If the query is hard to understand now, what if you have to debug something in it?
Just because you see a large query, doesn't mean it needs to be changed or optimized, but if it's too complicated for its own good, then you might want to consider refactoring.

Just as in other languages, you can't determine the efficiency of a query based on a character count. Also, 1000 characters isn't what I could call "large", especially when you use good table/column names, aliases that make sense, etc.
If you're not comfortable enough with SQL to be able to "eye ball" the design merits of particular query, run it through a profiler and examine the execution plan. That'll give you a good idea of problems, if any, the code in question will suffer from.
My rule of thumb is this: write the best, tightest, simplest code you can, and optimize where needed - ie, where you see a performance bottleneck or where (as frequently happens) you slap yourself in the head and say "D'OH!" at about three in the morning on vacation.
Summary:Code well, and optimize where needed.
As Robert said, if you can't easily tell what the query is doing, it probably needs to be simplified.

If you are used to writing simple stuff, you may not realize how complex getting information for a complex report might be. Yes, queries can get long and complicated and still perform well for what they are being asked to do. Often the techniques that are used to performance tune something may make the code look more complicated to those less familar with advanced querying techniques. What counts is how long it takes to execute and whether it returns the correct data, not how many characters it has.
When I see a complex query, my first thought is does it return what the developer really intended to return (you'd be surprised at how often the answer to that is no) and then I look to see if it could be performance tuned. Yes there are many badly written long queries out there, but there are also as many or more that do what they are intended to do about as fast as it can be done without a major database redesign or faster hardware.

I'd suggest that it's not the characters that should measure the size/complexity of the query.
I'd boil it down to:
what's the goal of the query?
does it used set-based logic?
does it re-use any components?
does it JOIN improperly/poorly?
what are the performance implications?
maintainability concerns - is it written so that another developer can grok its intentions?

Where I work we've created stored procedures that exceed 1000 characters. I can't really say it was NECESSARY but sometimes haste wins out over efficiency (most notably when a quick fix is necessary for a client).
Having said that ... if given the time I would attempt to optimize a query as small/efficient as it can get without it being overly confusing. I've used nested stored procedures to make things a little more clear and/or functions as well.

The number of characters does not mean that a query needs to be optimized - it is what you see within those characters that does.
Things like subqueries on top of subqueries is something I would review. I'd review JOINs as well, but it shouldn't take long comparing to the ERD to know if there's an unnecessary JOIN - the first thing I'd look at would be what tables are joined but not used in the output, which is fine if the tables are link/corrollary/etc tables.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

Query complexity vs code complexity

So my question is rather simple, what's better and faster, doing calculations in code(let's say java) or just doing complex database queries(if we assume we can do one same action in both ways)? Which approach is better in general, and why?
I'd do in the code.
Doing business calculations in the queries in the DB gets the logic of the app spread and not easly understandable, plus you often get bound to specific storage (i.e. SQL Server/Oracle/MySql/etc) leaving the possibility to chage storage paradigma (i.e. to a NoSQL DB).
Then in the code you can apply some injection to change easly the behavior of your code, making it more manageable.
I generally find it faster (in development time) to write a query to do what I need. The first round focuses on logical correctness, and I'll complete the rest of the functionality using that query. I try to avoid doing queries in loops in this phase, but otherwise I don't worry too much about performance.
Then, I'll look at the performance of the whole feature. If the query is too slow, I'll EXPLAIN it and look at the server's statistics. My first focus is on indexes, if that doesn't work I'll try restructuring the query. Sometimes correlated subqueries are faster than joins, sometimes unions are faster than disjunctions in the WHERE clause, sometimes it's the other way around.
If I can't get satisfactory performance using a single query, I may break it up into multiple queries and/or do some of the work in code. Such code tends to be more complicated and longer than an equivalent query, so I try to avoid it unless necessary.

For really complex reports, do people sometimes code in their language rather than in sql?

I have some pretty complex reports to write. Some of them... I'm not sure how I could write an sql query for just one of the values, let alone stuff them in a single query.
Is it common to just pull a crap load of data and figure it all via code instead? Or should I try and find a way to make all the reports rely on sql?
I have a very rich domain model. In fact, parts of code can be expanded on to calculate exactly what they want. The actual logic is not all that difficult to write - and it's nicer to work my domain model than with SQL. With SQL, writing the business logic, refactoring it, testing it and putting it version control is a royal pain because it's separate from your actual code.
For example, one the statistics they want is the % of how much they improved, especially in relation to other people in the same class, the same school, and compared to other schools. This requires some pretty detailed analysis of how they performed in the past to their latest information, as well as doing a calculation for the groups you are comparing against as a whole. I can't even imagine what the sql query would even look like.
The thing is, this % improvement is not a column in the database - it involves a big calculation in of itself by analyzing all the live data in real-time. There is no way to cache this data in a column as doing this calculation for every row it's needed every time the student does something is CRAZY.
I'm a little afraid about pulling out hundreds upon hundreds of records to get these numbers though. I may have to pull out that many just to figure out 1 value for 1 user... and if they want a report for all the users on a single screen, it's going to basically take analyzing the entire database. And that's just 1 column of values of many columns that they want on the report!
Basically, the report they want is a massive performance hog no matter what method I choose to write it.
Anyway, I'd like to ask you what kind of solutions you've used to these kind of a problems.
Sometimes a report can be generated by a single query. Sometimes some procedural code has to be written. And sometimes, even though a single query CAN be used, it's much better/faster/clearer to write a bit of procedural code.
Case in point - another developer at work wrote a report that used a single query. That query was amazing - turned a table sideways, did some amazing summation stuff - and may well have piped the output through hyperspace - truly a work of art. I couldn't have even conceived of doing something like that and learned a lot just from readying through it. It's only problem was that it took 45 minutes to run and brought the system to its knees in the process. I loved that query...but in the end...I admit it - I killed it. ((sob!)) I dismembered it with a chainsaw while humming "Highway To Hell"! I...I wrote a little procedural code to cover my tracks and...nobody noticed. I'd like to say I was sorry, but...in the end the job ran in 30 seconds. Oh, sure, it's easy enough to say "But performance matters, y'know"...but...I loved that query... ((sniffle...)) Anybody seen my chainsaw..? >;->
The point of the above is "Make Things As Simple As You Can, But No Simpler". If you find yourself with a query that covers three pages (I loved that query, but...) maybe it's trying to tell you something. A much simpler query and some procedural code may take up about the same space, page-wise, but could possibly be much easier to understand and maintain.
Share and enjoy.
Sounds like a challenging task you have ahead of you. I don't know all the details, but I think I would go at it from several directions:
Prioritize: You should try to negotiate with the "customer" and prioritize functionality. Chances are not everything is equally useful for them.
Manage expectations: If they have unrealistic expectations then tell them so in a nice way.
IMHO SQL is good in many respects, but it's not a brilliant programming language. So I'd rather just do calculations in the application rather than in the database.
I think I'd go for some delay in the system .. perhaps by caching calculated results for some minutes before recalculating. This is with a mind towards performance.
The short answer: for analysing large quantities of data, a SQL database is probably the best tool around.
However, that does not mean you should analyse this straight off your production database. I suggest you look into Datawarehousing.
For a one-off report, I'll write the code to produce it in whatever I can best reason about it in.
For a report that'll be generated more than once, I'll check on who is going to be producing it the next time. I'll still write the code in whatever I can best reason about it in, but I might add something to make it more attractive to use to that other person.
People usually use a third party report writing system rather than writing SQL. As an application developer, if you're spending a lot of time writing complex reports, I would severely question your manager's actions in NOT buying an off-the-shelf solution and letting less-skilled people build their own reports using some GUI.

Fastest way to become a MySQL expert?

I have been using MySQL for years, mainly on smaller projects until the last year or so. I'm not sure if it's the nature of the language or my lack of real tutorials that gives me the feeling of being unsure if what I'm writing is the proper way for optimization purposes and scaling purposes.
While self-taught in PHP I'm very sure of myself and the code I write, easily can compare it to others and so on.
With MySQL, I'm not sure whether (and in what cases) an INNER JOIN or LEFT JOIN should be used, nor am I aware of the large amount of functionality that it has. While I've written code for databases that handled tens of millions of records, I don't know if it's optimum. I often find that a small tweak will make a query take less than 1/10 of the original time... but how do I know that my current query isn't also slow?
I would like to become completely confident in this field in the ability to optimize databases and be scalable. Use is not a problem -- I use it on a daily basis in a number of different ways.
So, the question is, what's the path? Reading a book? Website/tutorials? Recommendations?
EXPLAIN is your friend for one. If you learn to use this tool, you should be able to optimize your queries very effectively.
Scan the the MySQL manual and read Paul DuBois' MySQL book.
Use EXPLAIN SELECT, SHOW VARIABLES, SHOW STATUS and SHOW PROCESSLIST.
Learn how the query optimizer works.
Optimize your table formats.
Maintain your tables (myisamchk, CHECK TABLE, OPTIMIZE TABLE).
Use MySQL extensions to get things done faster.
Write a MySQL UDF function if you notice that you would need some
function in many places.
Don't use GRANT on table level or column level if you don't really need
it.
http://dev.mysql.com/tech-resources/presentations/presentation-oscon2000-20000719/index.html
The only way to become an expert in something is experience and that usually takes time. And a good mentor(s) that are better than you to teach you what you are missing. The problem is you don't know what you don't know.
Research and experience - if you don't have the projects to warrant the research, make them. Make three tables with related data and make up scenarios.
E.g.
Make a table of movies their data
make a table of user
make a table of ratings for users
spend time learning how joins work, how to get movies of a particular rating range in one query, how to search the movies table ( like, regex) - as mentioned, use explain to see how different things affect speed. Make a day of it; I guarantee your
handle on it will be greatly increased.
If you're still struggling for case-scenarios, start looking here on SO for questions and try out those scenarios yourself.
I don't know if MIT open courseware has anything about databases... Well whaddya know? They do: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-830Fall-2005/CourseHome/
I would recommend that as one source based only on MITs reputation. If you can take a formal course from a university you may find that helpful. Also a good understanding of the fundamental discrete mathematics/logic certainly would do no harm.
As others have said, time and practice is the only real approach.
More practically, I found that EXPLAIN worked wonders for me personally. Learning to read the output of that was probably the biggest single leap I made in being able to write efficient queries.
The second thing I found really helpful was SQL Tuning by Dan Tow, which describes a fairly formal methodology for extracting performance. It's a bit involved, but works well in lots of situations. And if nothing else, it will give you a much better understanding of the way joins are processed.
Start with a class like this one: https://www.udemy.com/sql-mysql-databases/
Then use what you've learned to create and manage a number of SQL databases and run queries. Getting to the expert level is really about practice. But of course you need to learn the pieces before you can practice.

Is premature optimization in SQL as "evil" as it is in procedural programming languages?

I'm learning SQL at the moment and I've read that joins and subqueries can potentially be performance destroyers. I (somewhat) know the theory about algorithmic complexity in procedural programming languages and try to be mindful of that when programming, but I don't know how expensive different SQL queries can be. I'm deciding whether I should invest time in learning about SQL performance or just notice it when my queries run slow. The base question for me then is: is premature optimization for SQL as evil as it is for procedural languages?
As added information, I work in an environment where, most of the time, high performance is not an issue and the biggest tables I have to work with have some 150k rows.
Here's the Donald Knuth quote I refer to when saying "evil":
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil. Yet we should not
pass up our opportunities in that
critical 3%.
I would say that some general notions about performance are a must-have : it'll prevent you from writing really bad queries that can hurt your application (Even if you don't have millions of rows in your tables).
It'll also help you design your database so it's more officient-oriented : you'll have some ideas about where to put indexes, for instance.
But you shouldn't have performance as a first goal : the first thing is to have an application that works ; and, then, if needed, you'll optimize it (having some performance notions while developping will help you have an application that's easier to optimize, though).
Note I would not say that "having notions about performances" is "premature optimization", as long as you don't just "optimize", but just "write correctly" ; I would rather call it good practice that'll help to write better quality code ;-)
What Knuth means is: it's really, really important to know about SQL optimisation but only when you need to. As you say, "most of the time ... high performance is not an issue."
It's that 3% of times when you do need high performance that it's important to know what rules to break and why.
However, unlike procedural languages, even for 150k rows it can be important to know a little about how your query is processed. For instance free text searching will be very slow compared with searching through exact matches on indexed columns. It's going the final steps into e.g. sharding or full denormalisation where most DBAs and developers draw the line.
I wouldn't say that SQL optimization has as many pitfalls as premature programming optimization. Designing your schema and queries ahead of time with performance in mind can help you avoid some really nasty redesigns later on. That being said, spending a day getting rid of a table scan can be utterly worthless to you in the long run if that query isn't a slow query, can be cached, or is rarely called in a manner that would impact your application.
I personally profile my queries and focus on the worst, and most used queries. Careful design ahead of time cuts out most of the worst.
I would say that you should make the SQL as easily readeble as possible, and only worry about the performance once it hits you.
That said.
Be mindfull of standard things as you develop, such as indexes, sub selects, use of cursors where a standard query would do the job, etc.
It will not hurt to develop the original correctly, and you can optimize the problems later when it is needed.
EDIT
Also remeber that maintainability of your SQL code is very important, and that debugging SQL is slightly more difficult than normal coding.
Knuth says "forget about 97%" but for a typical web app it's in the database IO where 97% of the request time is spent. This is where a little optimization effort can yield greatest results.
If this is the kind of apps you're writing I strongly suggest learning as much of how RDBMSes work as you can afford. Other people give you excellent suggestions, and I'd add that I usually follow this list top-down when deciding how to spent my "optimization budget":
Schema design. Think twelve times
about normalizaton and access
strategies. This will save you many
painful hours later.
Query readability. Related to #1,
sometimes trying to reogranize your
queries gives a better understanding
of how schema should look. Also it'll
help later when you ask for
help.
Avoid subqueries in SELECT list, use
JOINs.
If there are slow queries reach for
Profiler. Check for missing indexes
first And finally, if there are
still slow queries, try to rewrite
it.
Keep in mind also, that database performance very much depends on data distribution and number of simultaneous requests (because of locking). Even though a query completes in 1 sec. on your underpowered netbook it could take 15 seconds on the 8-core server. If possible, check your queries on actual data. If you know that concurrency is going to be high it's (paradoxically) better to use many small queries than one big one.
I agree with everything that's said here, and I'd like to add: make sure that your SQL is well-encapsulated so that, when you discover what needs to be optimized, there's only one place you need to change it, and the change will be transparent to whatever code calls it.
Personally, I like to encapsulate all of my SQL in PL/SQL procedures, but there are some who disagree with that. Whatever you do, I recommend trying to avoid putting your SQL "inline" with other sourcecode. That seems to always lead to cut-and-pasting and quickly becomes hard to maintain. Put your SQL elsewhere, and try to re-use it as much as possible.
Also, read up on indexes, how they really work, and when you should and shouldn't use them. A lot of people's first instinct, when they get a slow query, is to index the table to death. That might solve the problem in the short term, but long-term an over-index table will be slow to insert and update into. A few well-chosen indexes are much better than indexing every field. Try reading "Refactoring SQL Applications" by Stephane Faroult.
Finally, as said above, a properly normalized database design will help avoid 99% of your slow queries. Denormalization is neccesary sometimes, but it's important that you know the rules, before you break them.
Good luck!