SQL Server 2008 Partitioned Table and Parallelism - sql

My company is moving to SQL Server 2008 R2. We have a table with tons of archive data. Majority of the queries that uses this table employs DateTime value in the where statement. For example:
Query 1
SELECT COUNT(*)
FROM TableA
WHERE
CreatedDate > '1/5/2010'
and CreatedDate < '6/20/2010'
I'm making the assumption that partitions are created on CreatedDate and each partition is spread out across multiple drives, we have 8 CPUs, and there are 500 million records in the database that are evenly spread out across the dates from 1/1/2008 to 2/24/2011 (38 partitions). This data could also be portioned in to quarters of a year or other time durations, but lets keep the assumptions to months.
In this case I would believe that the 8 CPU's would be utilized, and only the 6 partitions would be queried for dates between 1/5/2010 and 6/20/2010.
Now what if I ran the following query and my assumptions are the same as above.
Query 2
SELECT COUNT(*)
FROM TableA
WHERE State = 'Colorado'
Questions?
1. Will all partitions be queried? Yes
2. Will all 8 CPUs be used to execute the query? Yes
3. Will performance be better than querying a table that is not partitoned? Yes
4. Is there anything else I'm missing?
5. How would Partition Index help?
I answer the first 3 questions above, base on my limited knowledge of SQL Server 2008 Partitioned Table & Parallelism. But if my answers are incorrect, can you provide feedback any why I'm incorrect.
Resource:
Video: Demo SQL Server 2008 Partitioned Table Parallelism (5 minutes long)
MSDN: Partitioned Tables and Indexes
MSDN: Designing Partitions to Manage Subsets of Data
MSDN: Query Processing Enhancements on Partitioned Tables and Indexes
MSDN: Word Doc: Partitioned Table and Index Strategies Using SQL Server 2008 white paper
BarDev

Partitioning is never an option for improving performance. The best you can hope for is to have on-par performance with non-partitioned table. Usually you get a regression that increases with the number of partitions. For performance you need indexes, not partitions. Partitions are for data management operations: ETL, archival etc. Some claim that partition elimination is possible performance gain, but for anything partition elimination can give placing the leading index key on the same column as the partitioning column will give much better results.
Will all partitions be queried?
That query needs an index on State. Otherwise is a table scan, and will scan the entire table. A table scan over a partitioned table is always slower than a scan over the same size non-partitioned table. The index itself can be aligned on the same partition scheme, but the leading key must be State.
Will all 8 CPUs be used to execute the query?
Parallelism has nothing to do with partitioning, despite the common misconception of the contrary. Both partitioned and non-partitioned range scans can be use a parallel operator, it will be the Query Optimizer decision.
Will performance be better than querying a table that is not
partitioned?
No
How would Partition Index help?
An index will help. If the index has to be aligned, then it must be partitioned. A non-partitioned index will be faster than a partitioned one, but the index alignment requirement for switch-in/switch-out operations cannot be circumvented.
If you're looking at partitioning, it should be because you need to do fast switch-in switch-out operations to delete old data past retention policy period or something similar. For performance, you need to look at indexes, not at partitioning.

Partitioning can increase performance--I have seen it many times. The reason partitioning was developed was and is performance, especially for inserts. Here is an example from the real world:
I have multiple tables on a SAN with one big ole honking disk as far as we can tell. The SAN administrators insist that the SAN knows all so will not optimize the distribution of data. How can a partition possibly help? Fact: it did and does.
We partitioned multiple tables using the same scheme (FileID%200) with 200 partitions ALL on primary. What use would that be if the only reason to have a partitioning scheme is for "swapping"? None, but the purpose of partitioning is performance. You see, each of those partitions has its own paging scheme. I can write data to all of them at once and there is no possibility of a deadlock. The pages cannot be locked because each writing process has an unique ID that equates to a partition. 200 partitions increased performance 2000x (fact) and deadlocks dropped from 7500 per hour to 3-4 per day. This for the simple reason that page lock escalation always occurs with large amounts of data and a high volume OLTP system and page locks are what cause deadlocks. Partitioning, even on the same volume and file group, places the partitioned data on different pages and lock escalation has no effect since processes are not attempting to access the same pages.
THe benefit is there, but not as great, for selecting data. But typically the partitioning scheme would be developed with the purpose of the DB in mind. I am betting Remus developed his scheme with incremental loading (such as daily loads) rather than transactional processing in mind. Now if one were frequently selecting rows with locking (read committed) then deadlocks could result if processes attempted to access the same page simultaneously.
But Remus is right--in your example I see no benefit, in fact there may be some overhead cost in finding the rows across different partitions.

the very first question i have is if your table has a clustered index on it. if not, you'll want one.
Also, you'll want a covering index for your queries. Covering Indexes
If you have a lot of historical data you might look into an archiving process to help speed up your oltp applications.

Related

Querying large table with filter vs small table in database - any performance gain?

I have a large table with 10 million records and is used for one of our existing applications. we are working on a new application wherein it requires only filtered result set of large table with 7000 records.
My question is will there be any performance gain going for a smaller table with 7000 records vs querying large table with filter condition(and it will joined to few other tables in the schema which are completely independant from existing application)? or should I avoid redundancy maintaining all the data in one table? This is the design in data warehouse. Please suggest!
Thank you!
For almost any database, using a sample table will be noticeably faster. This is because reading the records will require loading fewer data pages.
In addition, if the base table is being updated, then a "snapshot" is isolated from page, table, and row locks that occur on the main table. This is good from a performance perspective, but it means that the versions can get out-of-synch, which may be bad.
And, from a querying perspective, the statistics on the sample would be more accurate. This helps the optimizer choose the best query plans.
I can think of two cases where performance might not improve significantly. The first is if your database supports clustered indexes and the rows that you want are defined by a range of index keys (or a single key). These will be "adjacent", so the clustered index would scan about the same number of pages. There is a slight overhead for the actual index structure.
Similarly, if your records were so large that there was one record per data page, then the advantage of a second table would be less. It would eliminate the index access overhead, but not reduce the number of reads.
None of these considerations say whether or not you should use a separate table. You should test in your environment. The overhead of managing a separate table (and there is a cost to creating and deleting it both in terms of performance and application complexity) may outweigh small performance gains.

SAP HANA PARTITIONED TABLE CALCULATION VIEW RUNNING SLOW IN COMPARISON TO NON-PARTITIONED TABLE CALCULATION VIEWE

I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.

Relation with DB size and performance

Is there any relation between DB size and performance in my case:
There is a table in my Oracle DB that is used for logging. Now it has almost close to over 120 million rows and increases at a rate of 1000 rows per min. Each row has 6-7 columns with basic string data.
It is for our client. We never take any data from there but we might need that in case of any issues. However its fine if we clean up every month or so.
However the actual issue is will it affect performance of other transactional tables in the same db? Assuming the disk space as unlimited.
If 1000 rows/minute are being inserted into this table then about 40 million rows would be added per month. If this table has indexes I'd say that the biggest issue will be that eventually index maintenance will become a burden on the system, so in that case I'd expect performance to be affected.
This table seems like a good candidate for partitioning. If it's partitioned on the date/time that each row is added, with each partition containing one month's worth of data, maintenance would be much simpler. The partitioning scheme can be set up so that partitions are created automatically as needed (assuming you're on Oracle 11 or higher), and then when you need to drop a month's worth of data you can just drop the partition containing that data, which is a quick operation which doesn't burden the system with a large number of DELETE operations.
Best of luck.

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?

To what degree can effective indexing overcome performance issues with VERY large tables?

So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.