SQL 2005 DB Partitioning for SharePoint - sql-server-2005

Background
I have a massive db for a SharePoint site collection. It is 130GB and growing at 10gb per month. 100GB of the 130GB is in one site collection. 30GB is the version table. There is only one site collection - this is by design.
Question
Am I able to partition a database (SharePoint) using SQL 2005s data partitioning features (creating multiple data files)?
Is it possible to partition a database that is already created?
Has anyone partitioned a SharePoint DB? Will I encounter any issues?

You would have to create a partition set and rebuild the table on that partition set. SQL2005 can only partition on a single column, so you would have to have a column in the DB that
Behaves fairly predictably so you don't get a large skew in the amount of data in each partition
IIRC the column has to be a numeric or datetime value
In practice it's easiest if it's monotonically increasing - you can create a series of partitions (automatically or manually) and the system will fill them up as it gets to the range definitions.
A date (perhaps the date the document was entered) would be ideal. However, you may or may not have a useful column on the large table. M.S. tech support would be the best source of advice for this.
The partitioning should be transparent to the application (again, you need a column with appropriate behaviour to use as a partition key).
Unless you are lucky enough to have a partition key column that is also used as a search predicate in the most common queries you may not get much query performance benefit from the partitioning. An example of a column that works well is a date column on a data warehouse. However, your Sharepoint application may not make extensive use of this sort fo query.

Mauro,
Is there no way you can segment the data on a Sharepoint level?
ie you may have multiple "sites" using a single (SQL) content database.
You could migrate site data to a new content database, which will allow you to reduce the data in that large content site and then shrink the datafiles.
it will also assist you in managing your obvious continued growth.
James.

Related

Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

Can I create a new Partition in a Teradata table each time a new data load came in

I have approx 95 Data integration interfaces that are extracting and loading data from an SAP ECC environment. Each dataset "Lands" in Teradata in a single table and is subsequently processed into a more structured and Keyed table for downstreaam consumption. This takes care of dupes and delta changes, etc.
My problem is the initial Landing table which gets truncated each time a new load comes in. If a Landed dataset fails to process forward it will be truncated and a new dataset will replace that in Landing as the next load comes in. I did not design or develop any of this, but it causes a lot of grief as well as remedy (bug) tickets.
I want to propose a solution and wanted to know if it is possible to not truncate the initial Landing table but instead create a new Partition for each Landed dataset as a new load came in. In that fashion we would avoid missing the dataset altogether. Is this possible in Teradata?
I wanted to add that we are using Informatica to connect SAP BW to Teradata as that has some bearing on the degree of difficulty we are facing. Not a huge Fan of Informatica as it is used in this case.
Thanks in advance for everyone's help,
Pat
This problem is not to be solved by partioning the landing table. Note that partitions are created to enhance query performance by organizing "similar" rows into Amps.
A partitioned primary index (PPI) permits rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries that are predicated on primary index values. For details, see Database Design and SQL Request and Transaction Processing.
https://docs.teradata.com/reader/Fs1l1bqzqbnO0oVqjSVP5g/MMVfdnwiK91oXhkK53mN0Q
Teradata can bulk append data on populated tables using TTP Update Operator or Multiload.

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.