Does SQL Server support hash partitioning? - sql

My current database is SQL Server 2008 and will be upgrading to SQL Server 2014. I cannot confirm if SQL Server 2014 supports hash partitions. I have a single table that has about 29M records. This table is growing extremely fast. In the past year it is doubling every 3-4 months. I'd like to horizontally partition my table based on a client id. I've search online and cannot confirm they support it.

No, SQL Server does not support hash partitioning. As Ben says, you can roll your own using a hashing function and a persisted computed column. The only scenario when this is recommended is when latch contention on the last page is slowing down inserts, and no other time. Read Hash Partitioning, SQL Server, and Scaling Writes for more details.
This table is growing extremely fast. In the past year it is doubling every 3-4 months.
So, what does this have to do with hash partitioning, or with any partitioning? Partitioning gives no performance benefits, it purpose is data storage management. For performant access to large datasets, consider indexes. For analytic workloads, use columnstores. For general performance issues read How to analyse SQL Server performance.
Kendra Little has a decent article How To Decide if You Should Use Table Partitioning.

Not directly, but you can "fake it". Specifically, if you come up with your own hashing function (say ClientID modulo «desired number of partitions»), you can use that as your partitioning key (or part of it).

Related

Process more than 500 million rows to cube without performance issues

i have a huge database, for example :
My customer loads everyday 500 million records of data for sales in a buffer fact table called "Sales". I have to process this sales to my cube in append/update mode but this is destroying the performance even with 186 GB of RAM.
I've already tried to create indexes on the dimension tables, this help a little but not too much.
My customer said that they expect a 15% sales data increment every 6 months...
There is a smart way in order to load this data without to wait too many ours?
I'm using SQL-Server 2016.
Thanks!
You can adapt column store index feature of sql server 2016.
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to 10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x), columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
You can have get more idea about this from microsoft link
If you're using a SAN to store your database. You might want to look into some software like Condusiv V-locity to eliminate a lot of the I/O being sent to and received from the database engine.
I might suggest to create a separate database engine, to ship the transaction log over to a separate server and apply transaction logs to the DB every 15 minutes for you to create analytics without using live data. Also the heavy writes to the production DB will not affect your ability to create complex query that locks the table or rows from time to time on your reporting server.

Generate Query Hash and Query Plan Hashes for lots of Queries in MS SQL

I am working on optimizing the load on a decently large SQL server 2008 cluster, and I have a sample of the queries being submitted to the server over a short timespan. This amounts to about 1.7 million queries and I am working on determining how many are one off ad-hoc queries and how many are substantially the same and being submitted often by applications so that the highest usage and highest resource intensive queries are optimized first.
To do this, I wish to use the Query Hash and Query Plan hashes, and add them to my analysis table. The DMV's in SQL server only keep these values for a few minutes (maybe a bit longer depending on memory usage) so I can't query the DMV's to pull the hashes. I know that the hashes can be generated one at a time by using the SET SHOWPLAN_XML option, but that isn't exactly friendly as showplan must be turned on, the result returned and parsed, then show plan turned off to save to a table.
I am hoping there is an undocumented function that can generate the 2 hashes and return a value I can store to a table; so far I haven't found one. Does anyone know of such a function?

SQL Server 2008 Partitioned Table and Parallelism

My company is moving to SQL Server 2008 R2. We have a table with tons of archive data. Majority of the queries that uses this table employs DateTime value in the where statement. For example:
Query 1
SELECT COUNT(*)
FROM TableA
WHERE
CreatedDate > '1/5/2010'
and CreatedDate < '6/20/2010'
I'm making the assumption that partitions are created on CreatedDate and each partition is spread out across multiple drives, we have 8 CPUs, and there are 500 million records in the database that are evenly spread out across the dates from 1/1/2008 to 2/24/2011 (38 partitions). This data could also be portioned in to quarters of a year or other time durations, but lets keep the assumptions to months.
In this case I would believe that the 8 CPU's would be utilized, and only the 6 partitions would be queried for dates between 1/5/2010 and 6/20/2010.
Now what if I ran the following query and my assumptions are the same as above.
Query 2
SELECT COUNT(*)
FROM TableA
WHERE State = 'Colorado'
Questions?
1. Will all partitions be queried? Yes
2. Will all 8 CPUs be used to execute the query? Yes
3. Will performance be better than querying a table that is not partitoned? Yes
4. Is there anything else I'm missing?
5. How would Partition Index help?
I answer the first 3 questions above, base on my limited knowledge of SQL Server 2008 Partitioned Table & Parallelism. But if my answers are incorrect, can you provide feedback any why I'm incorrect.
Resource:
Video: Demo SQL Server 2008 Partitioned Table Parallelism (5 minutes long)
MSDN: Partitioned Tables and Indexes
MSDN: Designing Partitions to Manage Subsets of Data
MSDN: Query Processing Enhancements on Partitioned Tables and Indexes
MSDN: Word Doc: Partitioned Table and Index Strategies Using SQL Server 2008 white paper
BarDev
Partitioning is never an option for improving performance. The best you can hope for is to have on-par performance with non-partitioned table. Usually you get a regression that increases with the number of partitions. For performance you need indexes, not partitions. Partitions are for data management operations: ETL, archival etc. Some claim that partition elimination is possible performance gain, but for anything partition elimination can give placing the leading index key on the same column as the partitioning column will give much better results.
Will all partitions be queried?
That query needs an index on State. Otherwise is a table scan, and will scan the entire table. A table scan over a partitioned table is always slower than a scan over the same size non-partitioned table. The index itself can be aligned on the same partition scheme, but the leading key must be State.
Will all 8 CPUs be used to execute the query?
Parallelism has nothing to do with partitioning, despite the common misconception of the contrary. Both partitioned and non-partitioned range scans can be use a parallel operator, it will be the Query Optimizer decision.
Will performance be better than querying a table that is not
partitioned?
No
How would Partition Index help?
An index will help. If the index has to be aligned, then it must be partitioned. A non-partitioned index will be faster than a partitioned one, but the index alignment requirement for switch-in/switch-out operations cannot be circumvented.
If you're looking at partitioning, it should be because you need to do fast switch-in switch-out operations to delete old data past retention policy period or something similar. For performance, you need to look at indexes, not at partitioning.
Partitioning can increase performance--I have seen it many times. The reason partitioning was developed was and is performance, especially for inserts. Here is an example from the real world:
I have multiple tables on a SAN with one big ole honking disk as far as we can tell. The SAN administrators insist that the SAN knows all so will not optimize the distribution of data. How can a partition possibly help? Fact: it did and does.
We partitioned multiple tables using the same scheme (FileID%200) with 200 partitions ALL on primary. What use would that be if the only reason to have a partitioning scheme is for "swapping"? None, but the purpose of partitioning is performance. You see, each of those partitions has its own paging scheme. I can write data to all of them at once and there is no possibility of a deadlock. The pages cannot be locked because each writing process has an unique ID that equates to a partition. 200 partitions increased performance 2000x (fact) and deadlocks dropped from 7500 per hour to 3-4 per day. This for the simple reason that page lock escalation always occurs with large amounts of data and a high volume OLTP system and page locks are what cause deadlocks. Partitioning, even on the same volume and file group, places the partitioned data on different pages and lock escalation has no effect since processes are not attempting to access the same pages.
THe benefit is there, but not as great, for selecting data. But typically the partitioning scheme would be developed with the purpose of the DB in mind. I am betting Remus developed his scheme with incremental loading (such as daily loads) rather than transactional processing in mind. Now if one were frequently selecting rows with locking (read committed) then deadlocks could result if processes attempted to access the same page simultaneously.
But Remus is right--in your example I see no benefit, in fact there may be some overhead cost in finding the rows across different partitions.
the very first question i have is if your table has a clustered index on it. if not, you'll want one.
Also, you'll want a covering index for your queries. Covering Indexes
If you have a lot of historical data you might look into an archiving process to help speed up your oltp applications.

Upper Limit for Number of Rows In Open Source Databases?

I have a project in which I'm doing data mining a large database. I currently store all of the data in text files, I'm trying to understand the costs and benefits of storing the data relational database instead. The points look like this:
CREATE TABLE data (
source1 CHAR(5),
source2 CHAR(5),
idx11 INT,
idx12 INT,
idx21 INT,
idx22 INT,
point1 FLOAT,
point2 FLOAT
);
How many points like this can I have with reasonable performance? I currently have ~150 million data points, and I probably won't have more than 300 million. Assume that I am using a box with 4 dual-core 2ghz Xeon CPUs and 8GB of RAM.
PostgreSQL should be able to amply accommodate your data -- up to 32 Terabytes per table, etc, etc. If I understand correctly, you're talking about 5 GB currently, 10 GB max (about 36 bytes/row and up to 300 million rows), so almost any database should in fact be able to accommodate you easily.
FYI: Postgres scales better than MySQL on multi-processor / overlapping requests, from a review I was reading a few months back (sorry, no link).
I assume from your profile this is some sort of biometric (codon sequences, enzyme vs protein amino acid sequence, or some such) problem. If you are going to attack this with concurrent requests, I'd go with Postgres.
OTOH, if the data is going to be loaded once, then scanned by a single thread, maybe MySQL in its "ACID not required" mode would be the best match.
You've got some planning to do in case of access use case(s) before you can select the "best" stack.
MySQL is more than capable of serving your needs as well as Alex's suggestion of PostgreSQL. Reasonable performance shouldn't be difficult to achieve, but if the table is going to be heavily accessed and have a large amount of DML, you will want to know more about the locking used by the database you end up choosing.
I believe PostgreSQL can use row level locking out of the box, where MySQL will depend on the storage engine you choose. MyISAM only locks at the table level, and thus concurrency suffers, but storage engines such as InnoDB for MySQL can and will use row-level locking to increase throughput. My suggestion would be to start with MyISAM and move to InnoDB only if you find you need row level locking. MyISAM works well in most situations and is extremely light-weight. I've had tables over 1 billion rows in MySQL using MyISAM and with good indexing and partitioning, you can get great performance. You can read more about storage engines in MySQL at
MySQL Storage Engines and about table partitioning at Table Partitioning. Here is an article on partitions in practice on a table of 113M rows that you may find useful as well.
I think the benefits of storing the data in a relational database far outweigh the costs. There are so many things you can do once your data is within a database. Point in time recovery, ensuring data integrity, finer grained security access, partitioning of data, availability to other applications through a common language. (SQL) etc. etc.
Good luck with your project.

What are the benefits of using partitions with the Enterprise edition of SQL 2005

I'm comparing between two techniques to create partitioned tables in SQL 2005.
Use partitioned views with a standard version of SQL 2005 (described here)
Use the built in partition in the Enterprise edition of SQL 2005 (described here)
Given that the enterprise edition is much more expensive, I would like to know what are the main benefits of the newer enterprise built-in implementation. Is it just an time saver for the implementation itself. Or will I gain real performance on large DBs?
I know i can adjust the constraints in the first option to keep a sliding window into the partitions. Can I do it with the built in version?
searchdotnet rulz! check this out:
http://www.eggheadcafe.com/forumarchives/SQLServerdatawarehouse/Dec2005/post25052042.asp
Updated: that link is dead. So here's a better one
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx#sql2k5parti_topic6
From above:
Some of the performance and manageability benefits (of partioned tables) are
Simplify the design and
implementation of large tables that
need to be partitioned for
performance or manageability
purposes.
Load data into a new partition of an
existing partitioned table with
minimal disruption in data access in
the remaining partitions.
Load data into a new partition of an
existing partitioned table with
performance equal to loading the same
data into a new, empty table.
Archive and/or remove a portion of a
partitioned table while minimally
impacting access to the remainder of
the table.
Allow partitions to be maintained by switching partitions in and out of the partitioned table.
Allow better scaling and parallelism for extremely large operations over multiple related tables.
Improve performance over all partitions.
Improve query optimization time because each partition does not need to be optimized separately.
When using the partitioned tables you can more easily move data from partition to partition. You can also partition the indexes as well.
You can also move data from one partition to another table as needed with a single ALTER TABLE command.