I develop some application that has database wery generic so really can't use it for reporting. So I need solution how to create reporting. I'm developer so my knowledge in DBA domain is bounded. For now I have ideo to create another database where I'll pu denormalized data from original db. So I saw that I could use SSIS for that and woul be glad if someone could give me some advice how to attack that problem. Should I sync data once a day and run reports that way. Is there solution to sync data allways so reports would be up to date? Please any advice.. Thanks!
Damir,
What I get from your message is that you are getting close to build a Datawarehouse using a Star Schema pattern.
You could have two databases, One with normalized data and the other one with the Star Schema pattern (Your DW), and then create a script that would use your normalized data and put them in your datawarehouse. For the frequency of your script it is up to you : After each transaction, every hour, once a day, etc...
The advantage of having a datawarehouse is that you will be able to use OLAP cubes and the MDX language for your reports. It's a plus !
Hope it could help,
If you are on sql server 2005 or greater, explore Merge statement.
For smaller tables, just truncate and reload. 'Smaller' could be subjective - but if takes less than 2-3 minutes to load, that could be termed as small. Obviously, during that period any query that uses such tables would fail.
Related
I have a query which is pulling data from almost 125 different tables, I have created some 13 nested Stored Procedures calling other stored procedures to pull all the required data. Surprise Surprise Surprise The query takes ages to execute and sometimes I have to kill he connection and rerun it.
I have been advised to make use of staging table, Move required data there using SSIS packages and pull data from there, but I am a bit reluctant to use SSIS as Im not very comfortable with SSIS and This report is requested once in a while and also moving around 10-15 gb data for one report seems a lot of hassle.
Any suggestion any ideas please to make this hell of task a bit simpler, quicker and less error prone ???
Create a reporting database. On some frequency, be that hourly, daily, or whatever frequency meets the needs of the reports users, ETL the data from the transactional database into the reporting database.
You can use SSIS or you could choose to execute some stored procedures for ETL. Regardless, you probably will schedule it with a SQL Agent Job.
Finally, in terms of designing your report database, consider transforming the data in a way that will help the reports performance. Many people "flatten" or de-normalize data for the purpose of reporting. We ETL transactional data into a data warehouse that uses the "star schema" pattern and we also have an Analysis Services Database and MDX Reports as well. Most likely you don't need to go that far for one report, but, that is further down this same path of optimized data structures for reporting and BI.
I am in the process of building a Data Warehouse(DW) and I have a question about loading data; I would appreciate if you guys provide your thoughts on this.
I am planning to load all the tables one-to-one in a staging database first and then load the data into the DW from the staging database. I thought about hitting the OLTP system directly(no staging) but am not 100% sure this would be the best approach from a performance perspective.
Let me give you an example: In our OLTP database, we have a view called Customers that I’ll be pulling into our DW. The view on OLTP database is pretty complex and a select statement takes 8 minutes. So If I load this table directly into the DW and do an incremental load, am thinking this would take more time than loading the view into a staging table first. Also, since the load is going to take time, the DW availability would also be affected as the data won’t be available to users for querying.
What do you guys suggest? Is the staging approach outdated now? I want to understand what the pros and cons are. Thanks in advance for your help
I help maintain a data warehouse and while we don't use a staging database, we do use staging/working/intermediate/whatever_you_want_to_call_it tables.
The gist of what we do is this. We receive the raw data as a series of delimited files. We then do whatever we deem necessary to these files to produce load files. We then populate our working tables from the load files and do whatever we have to do to further prepare the data. Then we populate the real tables from the working tables.
We also do everything as a scheduled job, early in the morning before people come to work, to minimize the liklihood of people trying to query the warehouse while data is being loaded.
Is it correct in my understanding that we can build SSAS cubes sourcing from the transaction Systems? I meant the not the live but copy of the Live.
I'm trying to see if there is any scope to address few reporting needs without the need to build a traditional Data Warehouse and then build cubes on top of the data warehouse, instead build cubes to do Financial monthly aggregated reporting needs sourcing from backup copy of the Transaction systems.
Alternatively, if you have any better way to proceed please suggest.
Regards,
KK
You can create a set of views on top of you transactional system tables and then build your SSAS cubes ontop of those views. This would be less effort than creating a fully fledged datawarehouse.
I am a data warehouse developer (and therefore believe in cubes), but not every reporting solution warrants the cost of building a cube. If your short to medium term reporting requirements are fixed and you don't have users requiring data to be sliced differently each week, then a series of fixed reports may suffice.
You can create a series of SQL Server Reporting Services reports (or extract to Excel) either directly against your copied transactional data, or against a series of summarised tables that are created periodically. If you decide to utilise a series of pre-formatted reporting tables, try to create tables that cover multiple similar reports (rather than 1 monthly report table = 1 report) for ease of ongoing maintenance.
There are many other important aspects to this that you may need to consider first. Like how busy is the transaction system, what is the size of the data, concurrency and availability issues etc.
It is absolutely fine to have a copy of your live data and then build a report on the top of it. Bear in mind that the data you see in the report will not be the latest and there will be a latency factor depending on the frequency of your data pull.
Firstly, let me apologize for the title, as it probably isn't as clear as I think it is.
What I'm looking for is a way to keep sample data in a database (SQL, 2005 2008 and Express) that get modified every so often. At present I have a handful of scripts to populate the database with a specific set of data, but every time the database is changed all the scripts have to be more or less rewritten and I was looking for some alternatives.
I've seen a number of tools and other software for creating sample data in a database, some free and some not. Are there any other methods I haven’t considered?
Thanks in advance for any input.
Edit: Also, if anyone has any advice at all in dealing with keeping data in sync with a changing application or database, that would be of some help as well.
If you are looking for tools for SQL server, go visit Red Gate Software, they have the best tools. They have a data compare tool that you can use to keep lookup type tables up-to-date and a SQL compare tool that you can use to keep the tables synched up between two datbases. So using SQL data compare, create a datbase with all the sample data you want. Then periodically refresh your testing db (or your prod db if these are strictly lookup type tables) using the compare tool.
I also like the alternative of having a script (you can use Red Gate's tool to create scripts) because that means you can store this info in your source control and use it as part of a deployment package to other servers.
You could save them in another database or the same db in different tables distinguished by the name, like employee_test
Joseph,
Do you need to keep just the data in sync, or the schema as well?
One solution to the data question would be SQL Server snapshots. You create a snapshot of your initial configuration, so any changes to the "real" database don't show up in the snapshot. Then, when you need to reset the table, select from the snapshot into a new table. I'm not sure how it will work if the schema changes, but it might be worth a try.
For generation of sample data, the Database project in Visual Studio has functionality that will create fake/random data.
Let me know if this make sense.
Erick
I wan't sure how to word this question so I'll try and explain. I have a third-party database on SQL Server 2005. I have another SQL Server 2008, which I want to "publish" some of the data in the third-party database too. This database I shall then use as the back-end for a portal and reporting services - it shall be the data warehouse.
On the destination server I want store the data in different table structures to that in the third-party db. Some tables I want to denormalize and there are lots of columns that aren't necessary. I'll also need to add additional fields to some of the tables which I'll need to update based on data stored in the same rows. For example, there are varchar fields that contain info I'll want to populate other columns with. All of this should cleanse the data and make it easier to report on.
I can write the query(s) to get all the info I want in a particular destination table. However, I want to be able to keep it up-to-date with the source on the other server. It doesn't have to be updated immediately (although that would be good) but I'd like for it be updated perhaps every 10 minutes. There are 100's of thousands of rows of data but the changes to the data and addition of new rows etc. isn't huge.
I've had a look around but I'm still not sure the best way to achieve this. As far as I can tell replication won't do what I need. I could manually write the t-sql to do the updates perhaps using the Merge statement and then schedule it as a job with sql server agent. I've also been having a look at SSIS and that looks to be geared at the ETL kind of thing.
I'm just not sure what to use to achieve this and I was hoping to get some advice on how one should go about doing this kind-of thing? Any suggestions would be greatly appreciated.
For that tables whose schemas/realtions are not changing, I would still strongly recommend Replication.
For the tables whose data and/or relations are changing significantly, then I would recommend that you develop a Service Broker implementation to handle that. The hi-level approach with service broker (SB) is:
Table-->Trigger-->SB.Service >====> SB.Queue-->StoredProc(activated)-->Table(s)
I would not recommend SSIS for this, unless you wanted to go to something like dialy exports/imports. It's fine for that kind of thing, but IMHO far too kludgey and cumbersome for either continuous or short-period incremental data distribution.
Nick, I have gone the SSIS route myself. I have jobs that run every 15 minutes that are based in SSIS and do the exact thing you are trying to do. We have a huge relational database and then we wanted to do complicated reporting on top of it using a product called Tableau. We quickly discovered that our relational model wasn't really so hot for that so I built a cube over it with SSAS and that cube is updated and processed every 15 minutes.
Yes SSIS does give the aura of being mainly for straight ETL jobs but I have found that it can be used for simple quick jobs like this as well.
I think, staging and partitioning will be too much for your case. I am implementing the same thing in SSIS now but with a frequency of 1 hour as I need to give some time for support activities. I am sure that using SSIS is a good way of doing it.
During the design, I had thought of another way to achieve custom replication, by customizing the Change Data Capture (CDC) process. This way you can get near real time replication, but is a tricky thing.