Merge data using Integration Service - sql

Please Consider this scenario:
I have a table in my database. I want move this data in my OLAP database using SSIS.I can move all record from my table to OLAP database.The problem is I don't know how I can apply changes in OLAP environment.For example if just 100 record of my table were changed how I can apply this changes NOT copy all records from scratch.
How I can Merge this two tables?
thanks

There are two main approaches to this:
Lookup Transformation --> OLE DB Command / OLE DB Destination
Load all data to a staging table and perform the MERGE using SQL.
My Preference is for the latter because the update is SET Based, but I do use the former where I know it will be predominantly inserts.
With the former you will end up with a data flow task something like:
This is a OLE DB Source from the OLTP database, which then looks up against your OLAP Database to retrieve the surrogate key. Where there is no match it simple inserts a new record to the OLE DB Destination, when there is a match it does a conditional split, if any fields have changed it will use the OLE DB Command to update the OLAP table.
It can obviously get much more complicated than this, but this covers the simplest example.
You can also use the Slowly Changing Dimension Transformation to open up a wizard to create your data flow for you, which again gets a bit more complex:
As mentioned though, my Preference is for a staging table and a set based update, because the OLE DB Command executes on a row by row basis, so if you are updating millions of records this will take a long time. You can simply create a staging table on your OLAP database and move the data in with a simple OLE DB Source and Destination, then use MERGE to update the OLAP Table:
MERGE OLAP o
USING Staging s
ON o.BusinessKey = s.BusinessKey
AND o.Type2SCD = s.Type2SCD
AND o.Active = 1
WHEN MATCHED AND o.Type1SCD != s.Type1SCD THEN
UPDATE
SET Type1SCD = s.Type1SCD
WHEN NOT MATCHED BY TARGET THEN
INSERT (BusinessKey, Type1SCD, Type2SCD, Active, EffectiveDate)
VALUES (s.BusinessKey, s.Type1SCD, s.Type2SCD, 1, GETDATE())
WHEN NOT MATCHED BY SOURCE AND o.Active = 1 THEN
UPDATE
SET Active = 0;
The above assumes you have one active record per business Key, and both type 1 and type 2 slowly changing dimentions, it will insert a new record where there is no match on BusinessKey and Type2SCD, in addition it will set any unmatched records in the source table to inactive. When there is a match but the type 1 SCD is different this will be updated.
It is worth noting that MERGE has it's downsides, and you may want to write your set based upserts as separate INSERT and UPDATE statements. One major issue I have come across is that on all my Dimension tables I have a unique filtered index on my BusinessKey field WHERE Active = 1 to ensure there is only one active record, which the MERGE I have written should work fine for, but doesn't as detailed in this connect item. Although it was not the end of the world having to add OPTION (QUERYTRACEON 8790); to the end of all the MERGE statements in my ETL it was not ideal.

Sounds like you're wanting to use incremental loads.
The first five tutorials on this page should point you in the right direction - I found them really useful in the past.

Related

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?
Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

exporting a table containing a large number of records to to another database on a linked server

I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts
In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:
DECLARE #BatchSize INT = 1000
WHILE 1 = 1
BEGIN
INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)
(
id
,title
,Initials
,[Last Name]
,[Address1]
)
SELECT TOP(#BatchSize)
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS (
SELECT 1
FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
WHERE id = s.id
)
IF ##ROWCOUNT < #BatchSize BREAK
This works fine however it took 5 mins to transfer the data.
I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.
Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)
Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.
In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.
Repeat the above process but use this for your destination server (linked server).
Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.
Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.
Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.
OLE DB Source - use the first connection manager we defined and a query
SELECT
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [dbo].[TABLENAME1] s
Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.
From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)
Use your query to pull back the key value(s)
SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;
Map T2.id to the id column.
When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.
There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.
Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.
Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.
Design considerations for larger sets
It depends.
Really, it does. But, I would look at identifying where the pain point lies.
If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.
Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.
If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.
If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.

How can I copy and overwrite data of tables from database1 to database2 in SQL Server

I have a database1 which has more than 500 tables and I have database2 which also has the same number of tables and in both the databases the name of tables are same.. some of the tables have different table definitions, for example a table reports in database1 has 9 columns and the table reports in database2 has 10.
I want to copy all the data from database1 to database2 and it should overwrite the same data and append the columns if structure does not match. I have tried the import export wizard in SQL Server 2008 but it gives an error when it comes to the last step of copying rows. I don't have the screen shot of that error right now, it is my office PC. It says that error inserting into the readonly column xyz, some times it says that vs_isbroken, for the read only column error as I mentioned a enabled the identity insert but it did not help..
Please help me. It is an opportunity in my office for me.
SSIS and SQL Server 2008 Wizards can be finicky tools.
If you get a "can't insert into column ABC", then it could be one of the following:
Inserting into a PK column -> when setting up the mappings, you need to indicate to overwrite the value
Inserting into a column with a smaller range -> for example from nvarchar(256) into nvarchar(50)
Inserting into a calculated column (pointed out by #Nick.McDermaid)
You could also get issues with referential integrity if your database uses this (most do).
If you're going to do this more often, then I suggest you build an SSIS package instead of using the wizard tooling. This way you will see warnings on all sorts of issues like the ones I've described above. You can then run your package on demand.
Another suggestion I would make, is that you insert DB1 into "stage" tables in DB2. These tables should have no relational integrity and will allow you to break the process into several steps as follows.
Stage the data from DB1 into DB2
Produce reports/queries on issues pertinent to your database/rules
Merge the data from stage tables into target tables using SQL
That last step is where you can use merge statements, or simple insert/updates depending on a key match. Using SQL here in the local database is then able to use set theory to manage the overlap of the two sets and figure out what is new or to be updated.
SSIS "can" do this, but you will not be able to do a bulk update using SSIS, whereas with SQL you can. SSIS would do what is known as RBAR (row by agonizing row), something slow and to be avoided.
I suggest you inform your seniors that this will take a little longer to ensure it is reliable and the results reportable. Then work step by step, reporting on each stages completion.
Another two small suggestions:
Create _Archive tables of each of the stage tables and add a Tstamp column to each. Merge into these after the stage step which will allow you to quickly see when which rows were introduced into DB2
After stage and before the SQL merge step, create indexes on your stage tables. This will improve the merge performance
Drop those Indexes after each merge, this will increase the bulk insert Performance
Basic on Staging (response to question clarification):
Links:
http://www.codeproject.com/Articles/173918/How-to-Create-your-First-SQL-Server-Integration-Se
http://www.jasonstrate.com/tag/31daysssis/
http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx
Staging is the act of moving data from one place to another without any checks.
First you need to create the target tables, the schema should match the source tables.
Open up BIDS and create a new Project and in it a new SSIS package.
In the package, create a connection for the source server and another for the destination.
Then create a data flow step, in the step create a data source for each table you want to copy from.
Connect each source to a new data destination and set the appropriate connection and table.
When done, save and do a test run.
Before the data flow step, you might like to add a SQL step that will truncate all the target tables.
If you're open to using tools then what about using something like Red Gate Sql Compare and Red Gate SQL Data Compare?
First I would use data compare to manage the schema differences, add the new columns you want to your destination database (database2) from the source (database1). Then with data compare you match the contents of the tables any columns it can't match based on names you specify how to handle. Then you can pick and choose what data you want to copy from your destination. So you'll see what data is new and what's different (you can delete data in the destination that's not in the source or ignore it). You can either have the tool do the work or create you a script to run when you want.
There's a 15 day trial if you want to experiment.
Seems like maybe you are looking for Replication technology as is offered by SQL Server Replication.
Well, if i understood your requirement correctly, you need to make database2 a replica of database1. Why not take a full backup of database1 and restore it as database2? Your database2 will be exactly what database1 is at the time of backup.

using ssis to move tabels and parts of tables from one database to another

I am currently using a database that is poorly designed and a slow pipeline so i decided to copy a small portion of the database (15 tables) and only bring over some of those tables for example i want to bring only the rows that have a certain id.
But this is not a one time move i need to add all the stuff that is added to the old database added to the new one on an hourly basis. My research has led me to SSIS and that it may have a way of accomplishing this but i have found no clear examples on how it is done if in fact it is possible. Thanks in advance.
Yes it is possible . You can schedule your ssis package through sql agent to run on hourly basis .
For a table ,you can drag a Data Flow Task on to the control flow .Inside DFT ,you need to place an oledb Source component ,Lookup ,Data conversion (if the types are different in source and target table) and Oledb destination .
oledb Source component : Create a variable of type string and in the expression write your sql query to fetch the data based on ID.Now use this variable in source component.
Lookup: You need to select your source table and combine the primary key column from the source and destination table.It acts similar to inner join query .After combining the primary key from the both the tables ,select the columns which you need from the source .
Oledb destination : Simply select your target table and map the columns from Lookup no matched output .If you need to update the values from the source then use Lookup matched output and connect it to an execute SQL task and write the update query .
Please go thru the link and SO
Scheduling of SSIS package

Bulk Insert from table to table

I am implementing an A/B/View scenario, meaning that the View points to table A, while table B is updated, then a switch occurs and the view points to table B while table A is loaded.
The switch occurs daily. There are millions of rows to update and thousands of users looking at the view. I am on SQL Server 2012.
My questions are:
how do I insert data into a table from another table in the fastest possible way? (within a stored proc)
Is there any way to use BULK INSERT? Or, is using regular insert/select the fastest way to go?
You could to a Select ColA, ColB into DestTable_New From SrcTable. Once DestTable_New is loaded, recreate indexes and constraints.
Then rename DestTable to DestTable_Old and rename DestTable_New to DestTable. Renaming is extremly quick. If something turns out to have gone wrong, you also have a backup of the previous table close by (DestTable_Old).
I did this scenario once where we had to have the system running 24/7 and needed to load tens of millions of rows each day.
I'd be inclined to use SSIS.
Make table A an OLEDB source and table B an OLEDB destination. You will bypass the transaction log so reduce the load on the DB. The only way (I can think of) to do this using T-SQL is to change the recovery model for your entire database, which is far from ideal because it means no transactions are stored, not just the ones for your transfer.
Setting up SSIS Transfer
Create a new project and drag a dataflow task to your design surface
Double click on your dataflow task which will take you through to the Data Flow tab. Then drag and drop an OLE DB source from the "Data flow Sources" menu, and an OLE DB destination from the "Data flow Destinations" menu
Double click on the OLE DB source, set up the connection to your server, choose the table you want to load from and click OK. Drag the green arrow from the OLE DB source to the destination then double click on the destination. Set up your connection manager, destination table name and column mappings and you should be good to go.
OLE DB Source docs on MSDN
OLE DB Destination docs on MSDN
You could do the
SELECT fieldnames
INTO DestinationTable
FROM SourceTable
as a couple answers suggest, that should be as fast as it can get (depending on how many indexes you'd need to recreate, etc).
But I would suggest using synonyms in order to change the pointer from one table to another. They're very transparent and in my opinion, cleaner than updating the view, or renaming tables.
I know the question is old, but I was hunting for an answer to the same question and didn't find anything really helpful. Yes the SSIS approach is a possibility, but the question wanted a stored proc.
To my delight I have discovered (almost) the solution that the original question wanted; you can do it with a CLR SP.
Select the data from TableA into a DataTable and then use the WriteToServer(DataTable dt) method of the SqlBulkCopy class with TableB as the DestinationTableName.
The only slight drawback is that the CLR procedure must use external access in order to use SqlBulkCopy, and does not work with context connection, so you need to fiddle a little bit with permissions and connection strings. But hey! nothing is ever perfect.
INSERT... SELECT... functions fairly similarly to BULK INSERT. You could use SSIS, like #GarethD says, but that might be overly complex if you're just copying rows from table1 to table2.
If you are copying serious quantities of data, keep an eye on the transaction log -- it can bloat up pretty fast when doing huge inserts. One work-around is to "chunkify" the data you are inserting, by looping over an insert statment that processes, say, only 100,000 or 10,000 rows a time (depends on how wide your rows are, i.e. how many MB per pass).
(Just curious, are you doing ALTER VIEW to reset the view? I did something similar once, though we had to have four tables and four views to support past/present/next/swap sets.)
You can simply do like this
select * into A from B Where [criteria]
This shall select the data from B, based on the criteria and shall insert it into A, provided the columns are same or you can specify column names instead of *.