SQL Azure Federations and Indexes - Clarification on Performance - sql

We currently have an SQL Federated DB split over 10 shards in roughly equal portions of data, filtered by a Client ID.
At the moment we are experiencing performance problems executing filtered queries, for example, running a query for a specific Client can take over 3 minutes to return 4000 rows in some shards. However, running exactly same query in an unfiltered connection on the same shard returns within a timely 4 seconds. The one noticable aspect is that the shards experiencing the slow down tend to contain more Clients albeit with less data. The most likely performance inhibitor (I believe) would be indexing and something that ties into the Filtered / Unfiltered connection.
Having a search around I haven't found much information on query performance across shards / specific Indexing strategies on shards (apart from Azure apparently doesn't support Indexed Views). My impression (and hence need for clarification) is that Indexes are applied to all members of a shard and not on a member by member basis.
If the former then we're in a bit of a pickle, apart from resharding this particular shard which doesn't make sense considering the only difference is the number of clients, not the size of the data. A couple of things we're about to try are explicitly adding the filter to the Indexes or even adding the filter to each query. Safe to say, we're not happy moving away from a Filtered connection.
Has anyone else experienced this problem or could possibly provide some direction that an unfiltered connection significantly outperforms a filtered connection?
Thanks in advance...

Indexes in federations are applied on a Federation Member-by-Member basis. If you started with a single indexed member and performed a SPLIT operation, then the indexes are automatically applied to the products of the SPLIT. But if you have applied indexes after multiple members were created, you need to explicitly add indexes to each member.
So hopefully you are not in a pickle.
You probably want to consider alternatives to federations moving forward since the feature is not supported by the new SKUs announced in April.

Related

How to avoid slow query if data is too big for one partition in NoSQL?

I am learning cassandra. Now, I am thinking about SQL's problems that NoSQL addresses, and I have a question about cases of very big data.
About SQL handling very big data, I thought that many pages are saying that tables will be on different servers and queries are slow because of joining tables on different servers. This is a problem of SQL that NoSQL addresses. But, even with NoSQL, if partitions are too big, do not I need to change my data model, make smaller partitions and make multiple queries on them to get the same result? And, is not it slow? Or, you never run out of a space in partition because 2B cells are big enough?
I think your question is mixing several different issues.
First of all, the problem with big data and SQL is usually not that queries become slow, but that the solution cannot scale as the data grows bigger and bigger. If you choose to manually split your tables to several servers, as you suggested, what do you do when you need even more servers - redesign your data model? Also, how do you ensure consistency when an update requires modifying several tables but they are on different hosts?
Second, you mentioned joins, and this is something which NoSQL solutions like Cassandra do not support. You need to manually denormalize the data yourself (i.e., put the already joined data in a table). For some things, Cassandra's new "Materialized Views" feature can come in handy.
Third, and perhaps most importantly, you asked about huge partitions. Indeed Cassandra is not designed to handle huge partitions, and the best practice is far below the 2-billion hard limit which you mentioned: Datastax (the commercial company behind Cassandra's development) suggests in https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html that a good rule of thumb is to "keep the maximum number of rows below 100,000 items and the disk size under 100 MB.".
There are several reasons why huge partitions are ill-advised in Cassandra. One of them is that the disk format (sstables and their so-called "promoted index") makes it inefficient to jump to the middle of a huge partition, and you need to do this when you want to read a specific row or iterate through all the rows. Some operations such as compaction and repair work on entire partitions and can become very slow (and in the worst case, also use a lot of memory). E.g., a case that a billion-row partition differs on two nodes by just one row, and the partition-based repair needs to send the entire partition over the network.
Scylla (https://en.wikipedia.org/wiki/Scylla_(database)), a Cassandra clone which is generally more efficient than Apache Cassandra, also has similar issues with huge partitions (as in Cassandra, moderately large partitions are fine), but these issues are actively being worked on, including re-designing the file format, so eventually Scylla should support arbitrary-sized partitions. However, we're still not there yet, and today the recommendation of not letting partitions grow too huge still applies to Scylla as well.
Finally, if you want to get around the problem of too many rows in a single partition, then, yes, you need to tweak your data model to avoid these huge partitions. Sometimes, you just need to fix design mistakes in your model - e.g., I have seen people sticking a lot of unrelated data into the same partition, when it could have easily (and more efficiently!) be put in separate partitions. Sometimes, you need to artificially split your partitions. This is common in so-called "time-series data" modeling in Cassandra, where we (for example) get a new value of some measurement every second and add it as a row to a partition. Here, instead of having one huge partition for all data ever, the accepted practice is to create a separate partition per time window (e.g., a new partition every day, or week, or whatever). Since most queries involve just one time window anyway, they don't even become slower.

Dynamic Index Creation - SQL Server

Ok, so I work for a company who sells a web product which has a MS SQL Server back end (can be any version, we've just changed our requirements to 2008+ now that 05 is out of extended support). All databases are owned by the company who purchases the product but we have VPN access and have a tech support department to deal with any issues. One part of my role is to act as 3rd line support for SQL issues.
When performance is a concern one of the usual checks is unused/missing indexes. We've got the usual standard indexes but depending on which modules or how a company utilises the system then it will require different indexes (there's an accounting module and a document management module amongst others). With hundreds of customers it's not possible to remote onto each on a regular basis in order to carry out optimisation work. I'm wondering if anybody else in my position has considered a scheduled task that may be able to drop and create indexes when needed?
I've got concerns (obviously), any changes that this procedure makes would also be stored in a table with full details of the change and a time stamp. I'd need this to be bullet proof, can't be sending something out into the wild if it may cause issues. I'm thinking an overnight or (probably) weekly task.
Dropping Indexes:
Would require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date (say 2 weeks or 1 month).
Only drop unused indexes for tables that are being actively used (indexes on unused parts of the system aren't a concern).
Log it.
This won't highlight duplicate indexes (that will have to be manual), just the quick wins (unused indexes with writes).
Creating Indexes
Only look for indexes with a value above a certain threshold.
Would have to check whether any similar indexes could be modified to cover the requirement. This could be on a ranking (check all indexed fields are the same and then score the included fields to see if additional would be needed).
Limit to a maximum number of indexes to be created (say 5 per week) to ensure it doesn't get carried away and create a bunch at once). This should help only focus on the most important indexes.
Log it.
This would need to be dynamic as we've got customers on different versions of the system with different usage patterns.
Just to clarify: I'm not expecting anybody to code for this, it's more a question relating to the feasibility and concerns for a task like this.
Edit: I've put a bounty on this to gather some further opinions and to get feedback from anybody who may have tried this before. I'll award it to the answer with the most upvotes by the time the bounty duration ends.
I can't recommend what you're contemplating, but you might be able to simplify your life by gathering the inputs to your contemplated program and making them available to clients and the support team.
If the problem were as simple as you suppose, surely the server itself or the tuning advisor would have solved it by now. You're making at least one unwarranted assumption,
require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date.
Table statistics are only as good as the last time the were updated after a significant change. Uptime won't guarantee anything about truncate table or a bulk insert.
This won't highlight duplicate indexes
But that's something you can do in a single query using the system tables. (It would be disappointing if the tuning gadget didn't help with those.) You could similarly look for overlapping indexes, such as for columns {a,b} and {a}; the second won't be useful unless {b} is selective and there are queries that don't mention {b}.
To look for new indexes, I would be tempted to try to instrument query use frequency and automate the analysis of query plan output. If you can identify frequently used, long-running queries and map their physical operations (table scan, hash join, etc.) onto the tables and existing indexes, you would have good input for adding and removing indexes. But you have to allow for the infrequently run quarterly report that, without its otherwise unused index, would take days to complete.
I must tell you that when I did that kind of analysis once some years ago, I was disappointed to learn that most problem children were awful queries, usually prompted by awful table design. No index will help the SQL mule. Hopefully that will not be your experience.
An aspect you didn't touch on that might be just as important is machine capacity. You might look into gathering, say, hourly snapshots of SQL Server stats, like disk queue depth and paging. Hardly a server exists that can't be improved with more RAM, and sometimes that's really the best answer.
SQL perf tuning advisor worth a check: https://msdn.microsoft.com/en-us/library/ms186232.aspx
another way could be to get performance data, start here: https://www.experts-exchange.com/articles/17780/Monitoring-table-level-activity-in-a-SQL-Server-database-by-using-T-SQL.html and generate indexes based on the performance table data
check this too : https://msdn.microsoft.com/en-us/library/dn817826.aspx

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.
I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.
90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.
It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.
If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.
In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.