SSIS DataFlowTask DefaultBufferSize and DefaultBufferMaxRows

SSIS DataFlowTask DefaultBufferSize and DefaultBufferMaxRows - sql

I have a task which pulls records from Oracle db to our SQL using dataflow task. This package runs everyday around 45 mins. This package will refresh about 15 tables. except one, others are incremental update. so almost every task runs 2 to 10 mins.
the one package which full replacement runs up to 25 mins. I want to tune this dataflow task to run faster.
There is just 400k of rows in the table. I did read some articles about DefaultBufferSize and DefaultBufferMaxRows. I have below doubts.
If I can set DefaultBufferSize upto 100 MB, Is there any place to look or analyse how much I can provide.
DefaultBufferMaxRows is set to 10k. Even If I give 50k and I provided 10 MB for DefaultBufferSize if which can only hold up to some 20k then what will SSIS do. Just ignore those 30k records or still it will pull all those 50k rocords(Spooling)?
Can I use Logging options to set proper limits?

As a general practice (and if you have enough memory), a smaller number of large buffers is better than a larger number of small buffers BUT not until the point where you have paging to disk (which is bad for obvious reasons)
To test it, you can log the event BufferSizeTuning, which will show you how many rows are in each buffer.
Also, before you begin adjusting the sizing of the buffers, the most important improvement that you can make is to reduce the size of each row of data by removing unneeded columns and by configuring data types appropriately.

Related

100 requests/second with multiple indexes results in server load of 25 in oracle database

I have an application which is using oracle database on 8 core machine and 16Gb RAM. The table have 15 columns and around 5700,000 rows. There are indexes on 5 columns which are frequently updated. When we put a load of 100 requests/second which is insert query and then there is some read and update operation going on every request loop , the CPU loads start increasing exponentially and reaches upto 25 and after that I start getting error
I/O Error : Socket read time out.
However when we perform same operations with indexes on single column then the load remains consistent to 4-5. Though Indexing on 5 columns and having machine of 8 core CPU and 16 Gb RAM , the load must not have that much difference.

It would be helpful if you could show DDL for the creation of the table and the creation of your five indexes. It would also be helpful to understand what each indexed column represents, what the distribution of that data looks like, how unique are the values, how frequently a column is updated vs being inserted, etc. The accuracy of answers depends greatly on the clarity of your question. There are knobs you can turn on index creation that might help your performance issues, but there is not enough information here to offer any assistance.

Loading huge flatfiles to SQL table is too slow via SSIS package

I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.

There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.

SSIS crash after few records

I have an SSIS package which suppose to take 100,000 records loop on them and for each one save the details to few tables.
It's working fine, until it reaches somewhere near the 3000 records, then the visual studio crashes. At this point devenv.exe used about 500MB and only 3000 rows were processed.
I'm sure the problem is not with a specific record because it always happens on different 3K of records.
I have a good computer with 2 GIG of ram available.
I'm using SSIS 2008.
Any idea what might be the issue?
Thanks.

Try increasing the default buffer size on your data flow tasks.
Example given here: http://www.mssqltips.com/sqlservertip/1867/sql-server-integration-services-ssis-performance-best-practices/
Best Practice #7 - DefaultBufferMaxSize and DefaultBufferMaxRows
As I said in the "Best Practices #6", the execution tree creates
buffers for storing incoming rows and performing transformations. So
how many buffers does it create? How many rows fit into a single
buffer? How does it impact performance?
The number of buffer created is dependent on how many rows fit into a
buffer and how many rows fit into a buffer dependent on few other
factors. The first consideration is the estimated row size, which is
the sum of the maximum sizes of all the columns from the incoming
records. The second consideration is the DefaultBufferMaxSize property
of the data flow task. This property specifies the default maximum
size of a buffer. The default value is 10 MB and its upper and lower
boundaries are constrained by two internal properties of SSIS which
are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size
of a buffer can be as small as 64 KB and as large as 100 MB. The third
factor is, DefaultBufferMaxRows which is again a property of data flow
task which specifies the default number of rows in a buffer. Its
default value is 10000.
Although SSIS does a good job in tuning for these properties in order
to create a optimum number of buffers, if the size exceeds the
DefaultBufferMaxSize then it reduces the rows in the buffer. For
better buffer performance you can do two things. First you can remove
unwanted columns from the source and set data type in each column
appropriately, especially if your source is flat file. This will
enable you to accommodate as many rows as possible in the buffer.
Second, if your system has sufficient memory available, you can tune
these properties to have a small number of large buffers, which could
improve performance. Beware if you change the values of these
properties to a point where page spooling (see Best Practices #8)
begins, it adversely impacts performance. So before you set a value
for these properties, first thoroughly testing in your environment and
set the values appropriately.
You can enable logging of the BufferSizeTuning event to learn how many
rows a buffer contains and you can monitor "Buffers spooled"
performance counter to see if the SSIS has began page spooling. I
will talk more about event logging and performance counters in my next
tips of this series.

Running Updates on a large, heavily used table

I have a large table (~170 million rows, 2 nvarchar and 7 int columns) in SQL Server 2005 that is constantly being inserted into. Everything works ok with it from a performance perspective, but every once in a while I have to update a set of rows in the table which causes problems. It works fine if I update a small set of data, but if I have to update a set of 40,000 records or so it takes around 3 minutes and blocks on the table which causes problems since the inserts start failing.
If I just run a select to get back the data that needs to be updated I get back the 40k records in about 2 seconds. It's just the updates that take forever. This is reflected in the execution plan for the update where the clustered index update takes up 90% of the cost and the index seek and top operator to get the rows take up 10% of the cost. The column I'm updating is not part of any index key, so it's not like it reorganizing anything.
Does anyone have any ideas on how this could be sped up? My thought now is to write a service that will just see when these updates have to happen, pull back the records that have to be updated, and then loop through and update them one by one. This will satisfy my business needs but it's another module to maintain and I would love if I could fix this from just a DBA side of things.
Thanks for any thoughts!

Actually it might reorganise pages if you update the nvarchar columns.
Depending on what the update does to these columns they might cause the record to grow bigger than the space reserved for it before the update.
(See explanation now nvarchar is stored at http://www.databasejournal.com/features/mssql/physical-database-design-consideration.html.)
So say a record has a string of 20 characters saved in the nvarchar - this takes 20*2+2(2 for the pointer) bytes in space. This is written at the initial insert into your table (based on the index structure). SQL Server will only use as much space as your nvarchar really takes.
Now comes the update and inserts a string of 40 characters. And oops, the space for the record within your leaf structure of your index is suddenly too small. So off goes the record to a different physical place with a pointer in the old place pointing to the actual place of the updated record.
This then causes your index to go stale and because the whole physical structure requires changing you see a lot of index work going on behind the scenes. Very likely causing an exclusive table lock escalation.
Not sure how best to deal with this. Personally if possible I take an exclusive table lock, drop the index, do the updates, reindex. Because your updates sometimes cause the index to go stale this might be the fastest option. However this requires a maintenance window.

You should batch up your update into several updates (say 10000 at a time, TEST!) rather than one large one of 40k rows.
This way you will avoid a table lock, SQL Server will only take out 5000 locks (page or row) before esclating to a table lock and even this is not very predictable (memory pressure etc). Smaller updates made in this fasion will at least avoid concurrency issues you are experiencing.
You can batch the updates using a service or firehose cursor.
Read this for more info:
http://msdn.microsoft.com/en-us/library/ms184286.aspx
Hope this helps
Robert

The mos brute-force (and simplest) way is to have a basic service, as you mentioned. That has the advantage of being able to scale with the load on the server and/or the data load.
For example, if you have a set of updates that must happen ASAP, then you could turn up the batch size. Conversely, for less important updates, you could have the update "server" slow down if each update is taking "too long" to relieve some of the pressure on the DB.
This sort of "heartbeat" process is rather common in systems and can be very powerful in the right situations.

Its wired that your analyzer is saying it take time to update the clustered Index . Did the size of the data change when you update ? Seems like the varchar is driving the data to be re-organized which might need updates to index pointers(As KMB as already pointed out) . In that case you might want to increase the % free sizes on the data and the index pages so that the data and the index pages can grow without relinking/reallocation . Since update is an IO intensive operation ( unlike read , which can be buffered ) the performance also depends on several factors
1) Are your tables partitioned by data 2) Does the entire table lies in the same SAN disk ( Or is the SAN striped well ?) 3) How verbose is the transaction logging . Can the buffer size of the transaction loggin increased to support larger writes to the log to suport massive inserts ?
Its also important which API/Language are you using? e.g JDBC support a batch update feature which makes the updates a little bit efficient if you are doing multiple updates .

Fastest way to do mass update

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.
The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.
Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.
Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?
NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!
PS: Yes, SomeOtherColumn is already indexed.

From everything I can see it does not look like your problems are related to indexes.
The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.
Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.
When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.
Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).
There is no way to disable the logs as it will compromise the database integrity.
I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.
I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.
You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.
Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.
As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.

Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.

You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.
What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.

Have you tried placing an index or statistics on someOtherColumn?

This really helped me. I went from 2 hours to 20 minutes with this.
/* I'm using database recovery mode to Simple */
/* Update table statistics */
set transaction isolation level read uncommitted
/* Your 50k update, just to have a measures of the time it will take */
set transaction isolation level READ COMMITTED
In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.
READ UNCOMMITTED
"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.
I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.

Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.

If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.
If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:
0) Backup your table
1) Index the flag column
2) Set the table option to "no log tranctions" ... if posible
3) write a stored procedure to run the updates

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas