SSIS - Import multiple sql views as tables periodically - sql

I'm fairly new to SQL Server management and am currently looking in to building a solution with SSIS.
My question will be mostly about "is my logic correct" and some smaller things about what the best practice would be.
But let me paint you a picture to start!
I have an application which provides me with DB Views, to offload the stress on that database I would like to transfer the data of these views periodically to a secondary database/different instance on which I can then also set more specific permissions/transformations/other views build on that data. My views initially provided are pretty much fixed in how I can get them.
After some reading it looked to me that the way to go was to use SSIS. I started building my package and used the "SSIS Import and Export wizard" to do an initial transfer.
Now for my first question, would this be the proper way to transfer the data and is SSIS the right tool for the job?
Secondly, I noticed that the wizard made multiple SQL Preperation tasks and Data flow tasks.
For me it would seem logical to split each view that becomes a table in to a seperate SQL Preperation Task and a seperate Data flow task just to keep a clear picture and as much control as possible. While that would take some time to set up (>100 views/tables) it seems cleaner than how the wizard did it by just grouping some of them together.
Also, since the preperation tasks already create tables these fail when executed a 2nd time as they already exist. Is there a quick workarround for this besides adding a IFEXISTS clause to each query?
Any thoughts on this would be appreciated or hints towards a better solution if I'm approaching this completly from the wrong direction.
The idea is to later on add some SSAS to the system and provide some data analytics on these tables/data as well.
Thanks!

If you are transferring the data to a different instance then SSIS is probably your best bet. Your next question is to work out whether you want to import all the data each time or just the new/updated items.
If you are exporting all the data every time, this is much simpler and assumig you have a suitable maintenance window (such as overnight) that you can complete the process in without affecting your end users, you can get away with simply truncating the data and re-loading. This obviously has consequences related to increased data transfer volumes.
If you want to only export the new/updated data, you will now need to work out whether or not you can actually work out which rows are new or different without simply comparing them all to what you have in your secondary database. Ideally your source tables will have a reliable LastUpdateDate column or better yet a rowversion column, using which you can export all rows with a more recent value than can be seen in the corresponding table in your secondary instance.
There is a lot of reading to be done regarding the updates only route, for which I would strongly suggest you avoid the Slowly Changing Dimension transformation like the plague.
You are also right in thinking that there is a lot of repetitive tasks when you want to do simple operations across a large number of similar objects, such as adding that ifexists to the table creation in your post. The best way to tackle this I have found is to learn how to use Biml to automate the repetitive tasks based on metadata.
Good Luck!

Related

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis
Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.
(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

Change detection in complex system

This might seem like a fairly specific question but I'm wondering if there is any technology/pattern out there that might help me in a current project. I have a hugely complex database which is updated by multiple systems. I now need to do change tracking on various bits of data that is spread across multiple tables so that I can send it to a third party system.
I've considered a number of options but unfortunately I can't seem to come to any other conclusion than using database triggers. I'm thinking of storing a flag in a table (or queue) to identify the rows that have changed and then building an xml diff containing the changed data to send to a web service. This feels a little dirty so I was wondering if anyone could think of a better alternative.
Depending on the database platform you're using, you might look into Change Data Capture. Since you mention .NET, here's some info about it: http://technet.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
Other database systems may offer something similar.
Another option would be insert/update/delete triggers on the tables, however triggers should be approached carefully as they can cause some significant performance problems if not done right.
And yet another option still would be what you describe - some sort of flag to monitor for changes. A simple CREATED and MODIFIED timestamp fields can go a long way here, as rather than just a bit indicator suggesting that the row may need attention, you'll know when the update happened, and your export process can be programmed accordingly (e.g., select * from table where modified > getdate()-1).

Moving data from production db to datawarehouse (SQL Server)

We are developing a reporting module for our software, and because of this we need to move some data from the system's production db into a datawarehouse db which will be used as the datasource for the reports (SQL Server reporting).
The schema in the production DB is quite old, so once we have data in the DW DB, we will need some additional fields (for example, calculating a correct datetime colum out of the prod db's 'date' and 'time' integer columns. (Don't ask, it's old.)
We are discussing internally how to do this in an efficient manner. Right now, it is implemented in a fugly SSIS job that basically tears down the entire DW DB every night and builds it up again from the prod db, doing data transformations as it goes. This doesn't scale very well.
I've been looking into using "newer" technologies, like for example SQL Server replication to move data in a more granular fashion.
My questions about this is:
-With replication the "move data" part is obviously solved, but not the data transform part. I know I can create update triggers on the DW DB, but all table-related triggers seem to be wiped whenever I do a reinitialize on the subscription, which makes it hard to set up.
I'm not looking for an exact answer here, more a hint on which direction to take this. Sorry if the question is a bit blurry.
update:
thanks for the good points below. This is software we're selling to customers, so I'm a big fan of having as few as possible "config items" for the customer to set up and maintain. The SSIS package as it stands today is one more "item" for the customer to keep tabs on, along with its schedules.
Replication intriguied me because it completely abscracts the whole CRUD "dilemma" when moving data, but you may be right - SSIS would still be better, as long as the SSIS logic is created a bit smarter than today.
Data might be quite large tho, so wiping and reimporting everything like we do today is definetely a problem that needs adressing
.
I don't think replication is a good idea. It would be if the source and destination schemas were exactly the same, but as you pointed out, they are not. And also, all the calculations you mention the SSIS is doing, you still would have to do it because replication wouldn't.
I think SSIS is the way to go, I mean, this is exactly why it exists.
Since you are recreating the DB on each load and if the amount of calculations and changes are not big and you don't need to do things lookups to get surrogate keys from natural keys, you could create views on your main database to try to mimic the structure of the destination database so you can do direct inserts (pretty much a source component mapped to a destiantion component)
maybe if you specify what's the real issue with SSIS you want to solve, it would be easier to help.
Just a quick update on this: The CDC functionality of SQL Server seems to be what we need to look into, this functionality integrates nicely with SSIS. Thanks for the hint on Slowly Changing Dimensions, and SSIS!

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

Effectively transforming data from one SQL database to another in live environment

We have a bit of a messy database situation.
Our main back-office system is written in Visual Fox Pro with local data (yes, I know!)
In order to effectively work with the data in our websites, we have chosen to regularly export data to a SQL database. However the process that does this basically clears out the tables each time and does a re-insert.
This means we have two SQL databases - one that our FoxPro export process writes to, and another that our websites read from.
This question is concerned with the transform from one SQL database to the other (SqlFoxProData -> SqlWebData).
For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)
This has been working fine for many months but now we are starting to hit upon performance problems when an update is taking place (this can happen regularly during the day)
Basically when we are updating SqlWebData.ImportantTable from SqlFoxProData.ImportantTable, it's causing occasional connection timeouts/deadlocks/other problems on the live websites.
I've worked hard at optimising queries, caching etc etc, but it's come to a point where I'm looking for another strategy to update the data.
One idea that has come to mind is to have two copies of ImportantTable (A and B), some concept of which table is currently 'active', updating the non-active table, then switching the currenly actice table
i.e. websites read from ImportantTableA whilst we're updating ImportantTableB, then we switch websites to read from ImportantTableB.
Question is, is this feasible and a good idea? I have done something like it before but I'm not convinced it's necessarily good for optimisation/indexing etc.
Any suggestions welcome, I know this is a messy situation... and the long term goal would be to get our FoxPro application pointing to SQL.
(We're using SQL 2005 if it helps)
I should add that data consistency isn't particularly important in the instance, seeing as the data is always slightly out of date
There are a lot of ways to skin this cat.
I would attack the locking issues first. It is extremely rare that I would use CURSORS, and I think improving the performance and locking behavior there might resolve a lot of your issues.
I expect that I would solve it by using two separate staging tables. One for the FoxPro export in SQL and one transformed into the final format in SQL side-by-side. Then either swapping the final for production using sp_rename, or simply using 3 INSERT/UPDATE/DELETE transactions to apply all changes from the final table to production. Either way, there is going to be some locking there, but how big are we talking about?
You should be able to maintain one db for the website and just replicate to that table from the other sql db table.
This is assuming that you do not update any data from the website itself.
"For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)"
I cannot think of a case where I would ever need to perform an insert, update or delete using a cursor. If you can write the select for the cursor, you can convert it into an insert, update or delete. You can join to other tables in these statements and use the case stament for conditional processing. Taking the time to do this in a set -based fashion may solve your problem.
One thing you may consider if you have lots of data to move. We occassionally create a view to the data we want and then have two tables - one active and one that data will be loaded into. When the data is finsihed loading, as part of your process run a simple command to switch the table the view uses to the one you just finshed loading to. That way the users are only down for a couple of seconds at most. You won't create locking issues where they are trying to access data as you are loading.
You might also look at using SSIS to move the data.
Do you have the option of making the updates more atomic, rather than the stated 'clear out and re-insert'? I think Visual Fox Pro supports triggers, right? For your key tables, can you add a trigger to the update/insert/delete to capture the ID of records that change, then move (or delete) just those records?
Or how about writing all changes to an offline database, and letting SQL Server replication take care of the sync?
[sorry, this would have been a comment, if I had enough reputation!]
Based on your response to Ernie above, you asked how you replicate databases. Here is Microsoft's how-to about replication in SQL2005.
However, if you're asking about replication and how to do it, it indicates to me that you are a little light in experience for SQL server. That being said, it's fairly easy to muck things up and while I'm all for learning by experience, if this is mission critical data, you might be better off hiring a DBA or at the very least, testing the #$##$% out of this before you actually implement it.