I have a copy database that has one table. It needs to be refreshed each night. There are three approaches:
Use a VB6 recordset or .NET datareader to loop through all the records in the table. Make the appropriate changes.
Use SSIS to truncate the data in the table and then refresh it
Use a checksum to establish what records have changed then refresh that data using SSIS
Use a checksum to establish what records have changed then refresh that data using a recordset or datareader
The problem with approach one is that it is too slow. It takes about two weeks to run as there are 90,000,000 records. Approach four is too slow as well. There are about 20,000 updates per day.
Therefore I believe it is between option two and option three. Option two only takes about fifteen minutes. However, users could be searching the table whilst it is being truncated and refreshed.
I am wondering if I can use transactions to isolate the work. However, if I use a serialised transaction whilst the data is being refreshed then the table is locked for fifteen minutes. Is there another option?
What about a stored procedure that gets kicked off with a SQL job? It would probably work similarly to #1 but be faster. Re-indexing before doing your update would probably help too.
Definitely option 3. Truncating 90 mil. rows to refresh 20,000 is (as I'm sure you know) very inefficient and will just fill up your transaction logs unnecessarily.
SSIS has full transaction handling support, so should easily be able to meet your needs.
Here is a good article on using transactions with SSIS: https://www.mssqltips.com/sqlservertip/1585/how-to-use-transactions-in-sql-server-integration-services-ssis/
Related
I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.
This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.
It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.
I am inserting large amounts of data into a table.
For example once every 15 minutes, N records of data become available to be inserted into the table.
My question is, what should I do if inserting N records takes more than 15 minutes? That's, the next insertion cannot begin because the previous one is still in progress.
Please assume that I've used the most affordable hardware and even dropping indexes before starting to insert data does not make inserting faster than 15 minutes.
My preference is not to drop indexes though, because at the same time, the table is queried. What's the best practice in such scenario?
P.S. I don't have any actual code. I am just thinking of and questioning about a possible scenario.
If you are receiving/loading a large quantity of data every quarter hour, you have an operational requirement, not an application requirement, so use an operational solution.
All database have a "bulk insert" utility, sql server is no exception and even calls the function BULK INSERT:
BULK INSERT mytable FROM 'my_data_file.dat'
Such utilities are built for raw speed and will outstrip any alternative application solution.
Write a shell script to receive the data into a file, formatting it as required using shell utilities, and invoke BULK INSERT.
Wire the process up to crontab (or the equivalent Windows scheduler such as AT if you are running on Windows).
First thing is to look for basic optimizations for inserts.
You can find many posts about it:
What is the fastest way to insert large number of rows
Insert 2 million rows into SQL Server quickly
Second thing is to see why it takes more than 15 minutes? Many things can explain that - locks, isolation level etc. So try to challenge it (for example can some portion of the queries can read uncommitted records?).
Third thing - finding the right quota for insert, and consider splitting to several smaller chunks of data, with intermediate commits. Many inserts in one transaction without committing may have a bad affect on the server (log file/locks wise - you need to be able to rollback the entire transaction).
i have an SQL table and VB.NET application.
the application loads the sql table to a datatable then it starts updating data to records by fetching some websites, it takes an average of 1.4 sec to fill datatable row with new data.
now i was wondering if its ok to use the sql update command to update a single record in the sql table and run it every time a record is updated which means run the update command for a single record every 1.4 sec
problem is other applications use this table in the same time and one of them writes to the same table but other columns,will the table get locked for other applications during this process?
SQL won't lock the table by default, but you probably should lock the table while updating it to prevent data corruption if those apps are doing alterations. performance will take a small hit, yes, but better that than having to rebuild it because it got messed up. this is a good explanation of locking
http://www.developerfusion.com/article/84509/managing-database-locks-in-sql-server/
if the other applications are just querying the table while you're updating, there shouldn't be any impact BUT they might get some odd results if they query it mid-update. locking is mainly about the risk of 2 people modifying the same record at the same time.
You need to find out why it takes 1.4 second to update a single record. Chances are it's because VB.NET needs to do some processing (while it's fetching some websites). For example, it could be taking you 1.3 seconds to perform necessary calculations (client time), and 0.1 second to update a single record (server time). In this case, you could perform update in batches, to minimize database access time.
Table will get locked, but only for a short time, so you don't need to worry about that, in general.
I have half a million records in a data set of which 50,000 are updated. Now I need to commit the updated records back to the SQL Server 2005 Database.
What is the best and efficient way to do this considering the fact that such updates could be frequent (though concurrency is not an issue but performance is)
I would use a Batch Update.
Also documented here.
I agree with David's answer, as that's what I use. However, there is an alternative approach you could take which is worth considering (all situations are different after all) - it's something I would consider in the future if I had another similar requirement.
You could bulk insert the updated records into a new table in the DB (using SqlBulkCopy) which is an extremely fast way of loading data into the db (example). Then run an UPDATE statement on your main table to pull in the updated values from this new table which you would drop at the end.
The batched update approach of using SqlDataAdapter allows you to easily deal with any errors on specific rows (e.g. you could tell it to continue in the event of an error with a specific updated row so it doesn't stop the whole process).
I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?
It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.
Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?
If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.
Can't you just use the same connection without closing it?