Using sqlite for very large merges and basic queries - sql

I'm a new guy to databases, and I'm trying to figure out a good solution for dealing with large datasets. I mostly do statistical analyses using R, so I don't need a database as the backend of web pages or anything. By datasets are generally static - they are just big.
I was trying to do a simple left join of a ~10,000,000 record table on a ~1,400,000 table. The 1.4 m table had unique records. After churning for 3 hours, it quit on me. The query was specified correctly - I ran it limiting the retrievals to 1000 records and it returned exactly as I expected. Eventually, I found a way to split this up into 10 queries and it ran, but by this time, I was able to do that merge in R pretty quickly, without all the fancy calls to sqlite and indexing.
I've been looking to use databases because I thought they were faster/more effective for these basic data manipulations, but maybe I'm just overlooking something. In the above example, I had indexed in the appropriate columns, and I'm surprised that sqlite could not handle it whilst R could.
Sorry if this question is a little foggy (I'm a little foggy on databases), but if anyone has any advice on something obvious I'm doing wrong to not take advantage of the power of sqlite, that would be great. Or am I just expecting to much of of it, and a 100 m X 1.4 m record merge is just too big to execute without breaking it up?
I would think that a database could outperform R in this respect?
thanks!
EXL

I am going through the same process. If you look through the questions I've asked recently, you may get some good pointers, or at least avoid a lot of the time I've wasted :). In short, here's what's been most helpful to me.
-- the RSQLite package
-- the RSQLite.extfuns package
-- the SQLite FAQ
I'm still a newbie, but in general, you should be using SQLite for subsetting data that is too large to bring in to RAM. I would think that if the data are small enough to handle in RAM, then you're better off using the native R tools for joins/subsets. If you find that you become more comfortable with SQL queries, then there is the sqldf package. Also, JD Long has a great discussion on using sqldf with large datasets.

I have to admit that I'm surprised that this has been a problem for you. SQLite has always worked well for me, at least speed-wise. However -- SQLite is easy because it is so flexible. SQLite can be dangerous because it is so flexible. SQLite tends to be very forgiving with data types. Sometimes this is an absolute god-send, when I don't want to spend a bunch of time tweaking things to perfection, but with great flexibility comes great responsibility.
I have noticed that I need to be careful moving data into SQLite. Text is easy. However, sometimes numbers get stored as text rather than numbers. Doing a JOIN on a column of numbers is faster than the same JOIN on a column of text. If your number columns are stored as text and then coerced into numbers for the comparison, you would lose most of the advantage of using an index.
I don't know how you got your data into SQLite, so the first thing I would do is look at your table schemas and make sure they make sense. And while they may seem obvious, indexes can be tricky. Taking a look at the queries might also result in something useful.
Without being able to see the underlying structure and queries, answers to this question will be educated guesses.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.
and so: are there concerns with SELECT * that are still relevant today?
greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.
for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?
thanks! :)
I will not touch maintainability in this answer, only performance part.
Performance in this context has little to do with ORMs.
It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.
It is still a bad idea to select columns that you don't need.
It doesn't really matter from the performance point of view whether the query looks like:
SELECT * FROM Table
or all columns are listed there explicitly, like:
SELECT Col1, Col2, Col3 FROM Table
If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.
Why selecting unnecessary columns is a bad idea:
extra bytes to read from disk
extra bytes to transfer over the network
extra bytes to parse on the client
But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.
There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.
Now that we have established a common understanding of why selecting
everything is bad for performance, you may ask why it is listed as a
myth? It's because many people think the star is the bad thing.
Further they believe they are not committing this crime because their
ORM lists all columns by name anyway. In fact, the crime is to select
all columns without thinking about it—and most ORMs readily commit
this crime on behalf of their users.
I'll add answers to your comments here.
I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction
It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.
You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.
It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.
Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?
You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.
The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.
And now on top of this you add another layer of ORM abstraction.
Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.
Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.
From a SQL-Server-Performance-Point-of-view, you should NEVER EVER use select *, because this means to sqlserver to read the complete row from disk or ram. Even if you need all fields, i would suggest to not do select *, because you do not know, who is appending any data to the table that your application does NOT need. For Details see answer of #sandip-patel
From a DBA-perspective: If you give exactly those columnnames you need the dbadmin can better analyse and optimize his databases.
From a ORM-Point-Of-View with changing column-names i would suggest to NOT use select *. You WANT to know, if the table changes. How do you want to give a guarantee for your application to run and give correct results if you do not get errors if the underlying tables change??
Personal Opinion: I really do not work with ORM in Applications needing to perform well...
This question is out some time now, and noone seems to be able to find, what Ben is looking for...
I think this is, because the answer is "it depends".
There just NOT IS THE ONE answer to this.
Examples
As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...
I could go on. There just is not ONE answer. It just depends on to many factors.
It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.
This can have several implications:
More network traffic
More I/O (got to read more data from disk)
Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)
Possibly even more CPU (a covering index cannot be used so data needs sorting)
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2 Where column = t1.colA)
More Details -1
More Details -2
More Details -3
Maintainability point.
If you do a "Select * from Table"
Then I alter the Table and add a column.
Your old code will likely crash as it now has an additional column in it.
This creates a night mare for future revisions because you have to identify all the locations for the select *.
The speed differences is so minimal I would not be concerned about it. There is a speed difference in using Varchar vs Char, Char is faster. But the speed difference is so minimal it is just about not worth talking about.
Select *'s biggest issue is with changes (additions) to the table structure.
Maintainability nightmare. Sign of a Junior programmer, and poor project code. That being said I still use select * but intend to remove it before I go to production with my code.

For really complex reports, do people sometimes code in their language rather than in sql?

I have some pretty complex reports to write. Some of them... I'm not sure how I could write an sql query for just one of the values, let alone stuff them in a single query.
Is it common to just pull a crap load of data and figure it all via code instead? Or should I try and find a way to make all the reports rely on sql?
I have a very rich domain model. In fact, parts of code can be expanded on to calculate exactly what they want. The actual logic is not all that difficult to write - and it's nicer to work my domain model than with SQL. With SQL, writing the business logic, refactoring it, testing it and putting it version control is a royal pain because it's separate from your actual code.
For example, one the statistics they want is the % of how much they improved, especially in relation to other people in the same class, the same school, and compared to other schools. This requires some pretty detailed analysis of how they performed in the past to their latest information, as well as doing a calculation for the groups you are comparing against as a whole. I can't even imagine what the sql query would even look like.
The thing is, this % improvement is not a column in the database - it involves a big calculation in of itself by analyzing all the live data in real-time. There is no way to cache this data in a column as doing this calculation for every row it's needed every time the student does something is CRAZY.
I'm a little afraid about pulling out hundreds upon hundreds of records to get these numbers though. I may have to pull out that many just to figure out 1 value for 1 user... and if they want a report for all the users on a single screen, it's going to basically take analyzing the entire database. And that's just 1 column of values of many columns that they want on the report!
Basically, the report they want is a massive performance hog no matter what method I choose to write it.
Anyway, I'd like to ask you what kind of solutions you've used to these kind of a problems.
Sometimes a report can be generated by a single query. Sometimes some procedural code has to be written. And sometimes, even though a single query CAN be used, it's much better/faster/clearer to write a bit of procedural code.
Case in point - another developer at work wrote a report that used a single query. That query was amazing - turned a table sideways, did some amazing summation stuff - and may well have piped the output through hyperspace - truly a work of art. I couldn't have even conceived of doing something like that and learned a lot just from readying through it. It's only problem was that it took 45 minutes to run and brought the system to its knees in the process. I loved that query...but in the end...I admit it - I killed it. ((sob!)) I dismembered it with a chainsaw while humming "Highway To Hell"! I...I wrote a little procedural code to cover my tracks and...nobody noticed. I'd like to say I was sorry, but...in the end the job ran in 30 seconds. Oh, sure, it's easy enough to say "But performance matters, y'know"...but...I loved that query... ((sniffle...)) Anybody seen my chainsaw..? >;->
The point of the above is "Make Things As Simple As You Can, But No Simpler". If you find yourself with a query that covers three pages (I loved that query, but...) maybe it's trying to tell you something. A much simpler query and some procedural code may take up about the same space, page-wise, but could possibly be much easier to understand and maintain.
Share and enjoy.
Sounds like a challenging task you have ahead of you. I don't know all the details, but I think I would go at it from several directions:
Prioritize: You should try to negotiate with the "customer" and prioritize functionality. Chances are not everything is equally useful for them.
Manage expectations: If they have unrealistic expectations then tell them so in a nice way.
IMHO SQL is good in many respects, but it's not a brilliant programming language. So I'd rather just do calculations in the application rather than in the database.
I think I'd go for some delay in the system .. perhaps by caching calculated results for some minutes before recalculating. This is with a mind towards performance.
The short answer: for analysing large quantities of data, a SQL database is probably the best tool around.
However, that does not mean you should analyse this straight off your production database. I suggest you look into Datawarehousing.
For a one-off report, I'll write the code to produce it in whatever I can best reason about it in.
For a report that'll be generated more than once, I'll check on who is going to be producing it the next time. I'll still write the code in whatever I can best reason about it in, but I might add something to make it more attractive to use to that other person.
People usually use a third party report writing system rather than writing SQL. As an application developer, if you're spending a lot of time writing complex reports, I would severely question your manager's actions in NOT buying an off-the-shelf solution and letting less-skilled people build their own reports using some GUI.

Fastest way to become a MySQL expert?

I have been using MySQL for years, mainly on smaller projects until the last year or so. I'm not sure if it's the nature of the language or my lack of real tutorials that gives me the feeling of being unsure if what I'm writing is the proper way for optimization purposes and scaling purposes.
While self-taught in PHP I'm very sure of myself and the code I write, easily can compare it to others and so on.
With MySQL, I'm not sure whether (and in what cases) an INNER JOIN or LEFT JOIN should be used, nor am I aware of the large amount of functionality that it has. While I've written code for databases that handled tens of millions of records, I don't know if it's optimum. I often find that a small tweak will make a query take less than 1/10 of the original time... but how do I know that my current query isn't also slow?
I would like to become completely confident in this field in the ability to optimize databases and be scalable. Use is not a problem -- I use it on a daily basis in a number of different ways.
So, the question is, what's the path? Reading a book? Website/tutorials? Recommendations?
EXPLAIN is your friend for one. If you learn to use this tool, you should be able to optimize your queries very effectively.
Scan the the MySQL manual and read Paul DuBois' MySQL book.
Use EXPLAIN SELECT, SHOW VARIABLES, SHOW STATUS and SHOW PROCESSLIST.
Learn how the query optimizer works.
Optimize your table formats.
Maintain your tables (myisamchk, CHECK TABLE, OPTIMIZE TABLE).
Use MySQL extensions to get things done faster.
Write a MySQL UDF function if you notice that you would need some
function in many places.
Don't use GRANT on table level or column level if you don't really need
it.
http://dev.mysql.com/tech-resources/presentations/presentation-oscon2000-20000719/index.html
The only way to become an expert in something is experience and that usually takes time. And a good mentor(s) that are better than you to teach you what you are missing. The problem is you don't know what you don't know.
Research and experience - if you don't have the projects to warrant the research, make them. Make three tables with related data and make up scenarios.
E.g.
Make a table of movies their data
make a table of user
make a table of ratings for users
spend time learning how joins work, how to get movies of a particular rating range in one query, how to search the movies table ( like, regex) - as mentioned, use explain to see how different things affect speed. Make a day of it; I guarantee your
handle on it will be greatly increased.
If you're still struggling for case-scenarios, start looking here on SO for questions and try out those scenarios yourself.
I don't know if MIT open courseware has anything about databases... Well whaddya know? They do: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-830Fall-2005/CourseHome/
I would recommend that as one source based only on MITs reputation. If you can take a formal course from a university you may find that helpful. Also a good understanding of the fundamental discrete mathematics/logic certainly would do no harm.
As others have said, time and practice is the only real approach.
More practically, I found that EXPLAIN worked wonders for me personally. Learning to read the output of that was probably the biggest single leap I made in being able to write efficient queries.
The second thing I found really helpful was SQL Tuning by Dan Tow, which describes a fairly formal methodology for extracting performance. It's a bit involved, but works well in lots of situations. And if nothing else, it will give you a much better understanding of the way joins are processed.
Start with a class like this one: https://www.udemy.com/sql-mysql-databases/
Then use what you've learned to create and manage a number of SQL databases and run queries. Getting to the expert level is really about practice. But of course you need to learn the pieces before you can practice.

is there such thing as a query being too big?

I don't have too much experience with SQL. Most of the queries I have written have been very small. Whenever I see a very large query, I always kinda assume it needs to be optimized. But is this true? or is there situations where really large queries are just whats needed?
BTW when I say large queries I mean queries that exceed 1000+ chars
Yes, any statement, method, or even query can be "too big".
The problem, is actually defining what too big really is.
If you can't sit down and figure out what the query does in a relatively short amount of time, it's probably best to break it up into smaller chunks.
I always like to look at things from a maintenance standpoint. If the query is hard to understand now, what if you have to debug something in it?
Just because you see a large query, doesn't mean it needs to be changed or optimized, but if it's too complicated for its own good, then you might want to consider refactoring.
Just as in other languages, you can't determine the efficiency of a query based on a character count. Also, 1000 characters isn't what I could call "large", especially when you use good table/column names, aliases that make sense, etc.
If you're not comfortable enough with SQL to be able to "eye ball" the design merits of particular query, run it through a profiler and examine the execution plan. That'll give you a good idea of problems, if any, the code in question will suffer from.
My rule of thumb is this: write the best, tightest, simplest code you can, and optimize where needed - ie, where you see a performance bottleneck or where (as frequently happens) you slap yourself in the head and say "D'OH!" at about three in the morning on vacation.
Summary:Code well, and optimize where needed.
As Robert said, if you can't easily tell what the query is doing, it probably needs to be simplified.
If you are used to writing simple stuff, you may not realize how complex getting information for a complex report might be. Yes, queries can get long and complicated and still perform well for what they are being asked to do. Often the techniques that are used to performance tune something may make the code look more complicated to those less familar with advanced querying techniques. What counts is how long it takes to execute and whether it returns the correct data, not how many characters it has.
When I see a complex query, my first thought is does it return what the developer really intended to return (you'd be surprised at how often the answer to that is no) and then I look to see if it could be performance tuned. Yes there are many badly written long queries out there, but there are also as many or more that do what they are intended to do about as fast as it can be done without a major database redesign or faster hardware.
I'd suggest that it's not the characters that should measure the size/complexity of the query.
I'd boil it down to:
what's the goal of the query?
does it used set-based logic?
does it re-use any components?
does it JOIN improperly/poorly?
what are the performance implications?
maintainability concerns - is it written so that another developer can grok its intentions?
Where I work we've created stored procedures that exceed 1000 characters. I can't really say it was NECESSARY but sometimes haste wins out over efficiency (most notably when a quick fix is necessary for a client).
Having said that ... if given the time I would attempt to optimize a query as small/efficient as it can get without it being overly confusing. I've used nested stored procedures to make things a little more clear and/or functions as well.
The number of characters does not mean that a query needs to be optimized - it is what you see within those characters that does.
Things like subqueries on top of subqueries is something I would review. I'd review JOINs as well, but it shouldn't take long comparing to the ERD to know if there's an unnecessary JOIN - the first thing I'd look at would be what tables are joined but not used in the output, which is fine if the tables are link/corrollary/etc tables.