I am building a database which contains about 30 tables:
The largest amount of columns in a table is about 15.
For datatypes I am mostly using VarChar(50) for text
and Int og SmallInt for numbers.
Identity columns is Uniqueidentifiers
I have been testing a bit filling in data and deleting
again. I have no deleted all data, so everey table is empty.
But, if I look inside the properties of the database in
Management Studio, the size says 221,38 MB!
How comes that? Please help, I am getting notifications
from my hosting company that I am exceeding my limits .
Best regards,
:-)
I would suggest that you look first at the recovery mode for the database. By default, the recovery mode is FULL. This fills the log file with all transactions that you perform, never deleting them until you do a backup.
To change the recovery mode, right click on the database and choose Properties. In the properties list, choose the Options (on the right hand pane). Then change the "Recovery model" to Simple.
You probably also want to shrink your files. To do this, right click on the database and choose Tasks --> Shrink --> Files. You can shrink both the data file and the log file, by changing the "File Type" option in the middle.
Martin's comment is quite interesting. Even if the log file is in auto-truncate mode, you still have the issue of deletes being logged. If you created large-ish tables, the log file will still expand and the space not recovered until you truncate the file. You can get around this by using TRUNCATE rathe than DELETE:
truncate table <table>
does not log every record being deleted (http://msdn.microsoft.com/en-us/library/ms177570.aspx).
delete * from table
logs every record.
As you do inserts, updates, deletes, and design changes a log file with every transaction, and a whole bunch of other data is created. This transaction log is a required component of a SQL Server database, and thus cannot be disabled in any available settings.
Below is an article from Microsoft on doing backups to shrink the transaction logs generated by SQL Server.
http://msdn.microsoft.com/en-us/library/ms178037(v=sql.105).aspx
Also, are you indexing your columns? Indexes that consist of several columns on tables with a high row count can become unnecessarily large, especially if you are just doing tests. Try just having a single clustered index on only one column per table.
You may also want to learn about table statistics. They help your indexes out and also help you perform queries like SELECT DISTINCT, or SELECT COUNT(*), etc.
http://msdn.microsoft.com/en-us/library/ms190397.aspx
Finally, you will need to upgrade your storage allocation for the SQL Server database. The more you use it, the faster it will want to grow.
Related
Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).
I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts
In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:
DECLARE #BatchSize INT = 1000
WHILE 1 = 1
BEGIN
INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)
(
id
,title
,Initials
,[Last Name]
,[Address1]
)
SELECT TOP(#BatchSize)
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS (
SELECT 1
FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
WHERE id = s.id
)
IF ##ROWCOUNT < #BatchSize BREAK
This works fine however it took 5 mins to transfer the data.
I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.
Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)
Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.
In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.
Repeat the above process but use this for your destination server (linked server).
Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.
Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.
Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.
OLE DB Source - use the first connection manager we defined and a query
SELECT
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [dbo].[TABLENAME1] s
Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.
From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)
Use your query to pull back the key value(s)
SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;
Map T2.id to the id column.
When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.
There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.
Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.
Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.
Design considerations for larger sets
It depends.
Really, it does. But, I would look at identifying where the pain point lies.
If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.
Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.
If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.
If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.
We have a 6 step where we copy tables from one database to another. Each step is executing a stored procedure.
Remove tables from destination database
Create tables in destination database
Shrink database log before copy
Copy tables from source to destination
Shrink the database log
Back up desstination database
during the step 4, our transaction log (ldf file) grows very large to where we now have to consistently increase the max size on the sql server and soon enough (in the far furture) we believe it may eat up all the resources on our server. It was suggested that in our script, we commit each transaction instead of waiting til the end to commit the transactions.
Any suggestions?
I'll make the assumption that you are moving large amounts of data. The typical solution to this problem is to break the copy up in to smaller number of rows. This keeps the hit on transaction log smaller. I think this will be the preferred answer.
The other answer that I have seen is using Bulk Copy, which writes the data out to a text file and imports it into your target db using Bulk Copy. I've seen a lot of posts that recommend this. I haven't tried it.
If the schema of the target tables isn't changing could you not just truncate the data in the target tables instead of dropping and recreating?
Can you change the database recovery model to Bulk Logged for this process?
Then, instead of creating empty tables at the destination, do a SELECT INTO to create them. Once they are built, alter the tables to add indices and constraints. Doing bulk copies like this will greatly reduce your logging requirements.
I am using SQL Express 2005 and do a backup of all DB's every night. I noticed one DB getting larger and larger. I looked at the DB and cannot see why its getting so big! I was wondering if its something to do with the log file?
Looking for tips on how to find out why its getting so big when its not got that much data in it - Also how to optimise / reduce the size?
Several things to check:
is your database in "Simple" recovery mode? If so, it'll produce a lot less transaction log entries, and the backup will be smaller. Recommended for development - but not for production
if it's in "FULL" recovery mode - do you do regular transaction log backups? That should limit the growth of the transaction log and thus reduce the overall backup size
have you run a DBCC SHRINKDATABASE(yourdatabasename) on it lately? That may help
do you have any log / logging tables in your database that are just filling up over time? Can you remove some of those entries?
You can find the database's recovery model by going to the Object Explorer, right click on your database, select "Properties", and then select the "Options" tab on the dialog:
Marc
If it is the backup that keeps growing and growing, I had the same problem. It is not a 'problem' of course, this is happening by design - you are just making a backup 'set' that will simply expand until all available space is taken.
To avoid this, you've got to change the overwrite options. In the SQL management studio, right-click your DB, TASKS - BACKUP, then in the window for the backup you'll see it defaults to the 'General' page. Change this to 'Options' and you'll get a different set of choices.
The default option at the top is 'Append to the existing media set'. This is what makes your backup increase in size indefinitely. Change this to 'Overwrite all existing backup sets' and the backup will always be only as big as one entire backup, the latest one.
(If you have a SQL script doing this, turn 'NOINIT' to 'INIT')
CAUTION: This means the backup will only be the latest changes - if you made a mistake three days ago but you only have last night's backup, you're stuffed. Only use this method if you have a backup regime that copies your .bak file daily to another location, so you can go back to any one of those files from previous days.
It sounds like you are running with the FULL recovery model and the Transaction Log is growing continuously as the result of no Transaction Log backups being taken.
In order to rectify this you need to:
Take a transaction log backup. (See: BACKUP(TRANSACT-SQL) )
Shrink the transaction log file down
to an appropriate size for your needs. (See:How to use DBCC SHRINKFILE.......)
Schedule regular transaction log
backups according to data recovery
requirements.
I suggest reading the following Microsoft reference in order to ensure that you are managing your database environment appropriately.
Recovery Models and Transaction Log Management
Further Reading: How to stop the transaction log of a SQL Server database from growing unexpectedly
One tip for keeping databases small would be at design time, use the smallest data type that you can use.
for Example you may have a status table, do you really need the index to be an int, when a smallint or tinyint will do?
Darknight
as you do a daily FULL backup for your Database , ofcourse it will get so big with time .
so you have to put a plan for your self . as this
1st day: FULL
/ 2nd day: DIFFERENTIAL
/ 3rd day: DIFFERENTIAL
/ 4th day: DIFFERENTIAL
/ 5th day: DIFFERENTIAL
and then start over .
and when you restore your database , if you want to restore the FULL you can do it easily , but when you need to restore the DIFF version , you backup the first FULL before it with " NO-recovery " then the DIFF you need , and then you will have your data back safely .
7zip your backup file for archiving. I recently backed up a database to a 178MB .bak file. After archiving it to a .7z file is was only 16MB.
http://www.7-zip.org/
If you need an archive tool that works with larger files sizes more efficiently and faster than 7zip does, I'd recommend taking a look at LZ4 archiving. I have used it for archiving file backups for years with no issues:
http://lz4.github.io/lz4/
I recently perform a purging on my application table. total record of 1.1 millions with the disk space used 11.12GB.
I had deleted 860k records and remain 290k records, but why my space used only drop to 11.09GB?
I monitor the detail on report - disk usage - disk space used by data files - space used.
Is it that i need to perfrom shrink data file? This has been puzzle me long time.
For MS SQL Server, rebuild the clustered indexes.
You have only deleted rows: not reclaimed space.
DBCC DBREINDEX or ALTER INDEX ... WITH REBUILD depending on verison
(It's MS SQL because the disk space report is in SSMS)
You need to explicitly call some operation (specific to your database management system) that will shrink the data file. The database engine doesn't shrink the file when you delete records, that's for optimization purposes - shrinking is time-consuming.
I think this is like with mail folders in Thunderbird: If you delete something, it's just marked as deleted, but to get higher performance, the space isn't freed. So most of your 11.09 GB will now contain either your old data or 0's. Shrink data file will "compress" (or "clean") this by creating a new file that'll only contain the actual data that is left.
Probably you need to shrink the table. I know that SQL server doesn't do it by default for you, I would guess this is for reasons of performance, maybe other DBs are the same.