SQL Azure - Max row size exceeded error at Standard 50 DTU but same query has no error at Premium 150 DTU - sql

Interested if anyone knows about what could be going on under the hood to cause the same query to fail at one Azure SQL performance level but work at a higher level?
The query does max out the server to 100% at the Standard level, but I would expect to get an out of memory related exception if this was the issue. But instead I get a
Cannot create a row of size 8075 which is greater than the allowable maximum row size of 8060
I am aware that the query needs to be optimized, but what I am interested in for the purposes of this question is what about bumping up to Premium 150DTU would suddenly make the same data not exceed the max row size?

I can make an educated guess as to what your problem is. When you change from one reservation size to another, the resources available to the optimizer changes. It believes it has more memory, specifically. Memory is a key component in costing queries in the query optimizer. The plan choice happening in the lower reservation size likely has a spool or sort that is trying to create an object in tempdb. The one in the higher reservation size is not. This it hitting a limitation in the storage engine since the intermediate table can not be materialized.
Without looking at the plan, it is not possible to say with certainty whether this is a requirement of the chosen query plan or merely an optimization. However, you can try using the NO_PERFORMANCE_SPOOL hint on the query to see if that makes it work on the smaller reservation size. (Given that it has less memory, I will guess that it is not the issue, however).
https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-2017
(Now I am guessing with general advice since I don't know what kind of app you have but it based on the normal patterns I see regularly):
If your schema is really wide or poorly defined, please consider revising your table definition to reduce the size of the columns to the right minimum. For data warehousing applications, please consider using dimension tables + surrogate keys. If you are dumping text log files into SQL and then trying to distinct them, note that distinct is often going to imply a sort which could lead to this kind of issue if the row is too wide (as you are trying to use all columns in the key).
Best of luck on getting your app to work. SQL tends to work very well for you for relational apps where you think through the details of your schema and indexes a bit. In the future, please post a bit more detail about your query patterns, schema, and query plans so others can help you more precisely.

Related

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

Dynamic Index Creation - SQL Server

Ok, so I work for a company who sells a web product which has a MS SQL Server back end (can be any version, we've just changed our requirements to 2008+ now that 05 is out of extended support). All databases are owned by the company who purchases the product but we have VPN access and have a tech support department to deal with any issues. One part of my role is to act as 3rd line support for SQL issues.
When performance is a concern one of the usual checks is unused/missing indexes. We've got the usual standard indexes but depending on which modules or how a company utilises the system then it will require different indexes (there's an accounting module and a document management module amongst others). With hundreds of customers it's not possible to remote onto each on a regular basis in order to carry out optimisation work. I'm wondering if anybody else in my position has considered a scheduled task that may be able to drop and create indexes when needed?
I've got concerns (obviously), any changes that this procedure makes would also be stored in a table with full details of the change and a time stamp. I'd need this to be bullet proof, can't be sending something out into the wild if it may cause issues. I'm thinking an overnight or (probably) weekly task.
Dropping Indexes:
Would require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date (say 2 weeks or 1 month).
Only drop unused indexes for tables that are being actively used (indexes on unused parts of the system aren't a concern).
Log it.
This won't highlight duplicate indexes (that will have to be manual), just the quick wins (unused indexes with writes).
Creating Indexes
Only look for indexes with a value above a certain threshold.
Would have to check whether any similar indexes could be modified to cover the requirement. This could be on a ranking (check all indexed fields are the same and then score the included fields to see if additional would be needed).
Limit to a maximum number of indexes to be created (say 5 per week) to ensure it doesn't get carried away and create a bunch at once). This should help only focus on the most important indexes.
Log it.
This would need to be dynamic as we've got customers on different versions of the system with different usage patterns.
Just to clarify: I'm not expecting anybody to code for this, it's more a question relating to the feasibility and concerns for a task like this.
Edit: I've put a bounty on this to gather some further opinions and to get feedback from anybody who may have tried this before. I'll award it to the answer with the most upvotes by the time the bounty duration ends.
I can't recommend what you're contemplating, but you might be able to simplify your life by gathering the inputs to your contemplated program and making them available to clients and the support team.
If the problem were as simple as you suppose, surely the server itself or the tuning advisor would have solved it by now. You're making at least one unwarranted assumption,
require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date.
Table statistics are only as good as the last time the were updated after a significant change. Uptime won't guarantee anything about truncate table or a bulk insert.
This won't highlight duplicate indexes
But that's something you can do in a single query using the system tables. (It would be disappointing if the tuning gadget didn't help with those.) You could similarly look for overlapping indexes, such as for columns {a,b} and {a}; the second won't be useful unless {b} is selective and there are queries that don't mention {b}.
To look for new indexes, I would be tempted to try to instrument query use frequency and automate the analysis of query plan output. If you can identify frequently used, long-running queries and map their physical operations (table scan, hash join, etc.) onto the tables and existing indexes, you would have good input for adding and removing indexes. But you have to allow for the infrequently run quarterly report that, without its otherwise unused index, would take days to complete.
I must tell you that when I did that kind of analysis once some years ago, I was disappointed to learn that most problem children were awful queries, usually prompted by awful table design. No index will help the SQL mule. Hopefully that will not be your experience.
An aspect you didn't touch on that might be just as important is machine capacity. You might look into gathering, say, hourly snapshots of SQL Server stats, like disk queue depth and paging. Hardly a server exists that can't be improved with more RAM, and sometimes that's really the best answer.
SQL perf tuning advisor worth a check: https://msdn.microsoft.com/en-us/library/ms186232.aspx
another way could be to get performance data, start here: https://www.experts-exchange.com/articles/17780/Monitoring-table-level-activity-in-a-SQL-Server-database-by-using-T-SQL.html and generate indexes based on the performance table data
check this too : https://msdn.microsoft.com/en-us/library/dn817826.aspx

Is Access 'select where' query performance affected by number of records

How much of an impact does the number of records in a table have on the performance of a query that returns approx 5k records using a 'date is greater than' criteria?
For example where a table might have a total of 50k records vs 100k.
In general, yes, more records to retrieve will take longer to load. However, it's impossible for us to give you a specific number, as there's a lot of factors involved, and every scenario would be different.
Specifically, it could depend on whether or not the database is hosted locally or on a server. It would depend on indexing, hardware (disks, specifically), possibly network speed, etc., etc.
If you're pulling twice as much data, theoretically it should take about double the time, but again, YMMV. With proper indexing, it might be faster.
Incidentally, if speed (and perhaps more importantly, recovery, security, storage size and multi-user usage) might be an issue, you're probably better off switching to SQL Server.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.

Is there any performance reason to use powers of two for field sizes in my database?

A long time ago when I was a young lat I used to do a lot of assembler and optimization programming. Today I mainly find myself building web apps (it's alright too...). However, whenever I create fields for database tables I find myself using values like 16, 32 & 128 for text fields and I try to combine boolean values into SET data fields.
Is giving a text field a length of 9 going to make my database slower in the long run and do I actually help it by specifying a field length that is more easy memory aligned?
Database optimization is quite unlike machine code optimization. With databases, most of the time you want to reduce disk I/O, and wastefully trying to align fields will only make less records fit in a disk block/page. Also, if any alignment is beneficial, the database engine will do it for you automatically.
What will matter most is indexes and how well you use them. Trying tricks to pack more information in less space can easily end up making it harder to have good indexes. (Do not overdo it, however; not only do indexes slow down INSERTs and UPDATEs to indexed columns, they also mean more work for the planner, which has to consider all the possibilities.)
Most databases have an EXPLAIN command; try using it on your selects (in particular, the ones with more than one table) to get a feel for how the database engine will do its work.
The size of the field itself may be important, but usually for text if you use nvarchar or varchar it is not a big deal. Since the DB will take what you use. the follwoing will have a greater impact on your SQL speed:
don't have more columns then you need. bigger table in terms of columns means the database will be less likely to find the results for your queries on the same disk page. Notice that this is true even if you only ask for 2 out of 10 columns in your select... (there is one way to battle this, with clustered indexes but that can only address one limited scenario).
you should give more details on the type of design issues/alternatives you are considering to get additional tips.
Something that is implied above, but which can stand being made explicit. You don't have any way of knowing what the computer is actually doing. It's not like the old days when you could look at the assembler and know pretty well what steps the program is going to take. A value that "looks" like it's in a CPU register may actually have to be fetched from a cache on the chip or even from the disk. If you are not writing assembler but using an optimizing compiler, or even more surely, bytecode on a runtime engine (Java, C#), abandon hope. Or abandon worry, which is the better idea.
It's probably going to take thousands, maybe tens of thousands of machine cycles to write or retrieve that DB value. Don't worry about the 10 additional cycles due to full word alignments.