I am in the process of building a Data Warehouse(DW) and I have a question about loading data; I would appreciate if you guys provide your thoughts on this.
I am planning to load all the tables one-to-one in a staging database first and then load the data into the DW from the staging database. I thought about hitting the OLTP system directly(no staging) but am not 100% sure this would be the best approach from a performance perspective.
Let me give you an example: In our OLTP database, we have a view called Customers that I’ll be pulling into our DW. The view on OLTP database is pretty complex and a select statement takes 8 minutes. So If I load this table directly into the DW and do an incremental load, am thinking this would take more time than loading the view into a staging table first. Also, since the load is going to take time, the DW availability would also be affected as the data won’t be available to users for querying.
What do you guys suggest? Is the staging approach outdated now? I want to understand what the pros and cons are. Thanks in advance for your help
I help maintain a data warehouse and while we don't use a staging database, we do use staging/working/intermediate/whatever_you_want_to_call_it tables.
The gist of what we do is this. We receive the raw data as a series of delimited files. We then do whatever we deem necessary to these files to produce load files. We then populate our working tables from the load files and do whatever we have to do to further prepare the data. Then we populate the real tables from the working tables.
We also do everything as a scheduled job, early in the morning before people come to work, to minimize the liklihood of people trying to query the warehouse while data is being loaded.
Related
What are the Pros and Cons of hive external and managed tables?
We want to do updates and inserts in Hive tables but wonder which approach to take for these (Managed tables or create a workaround with refreshing external tables after manual file updates), especially after adding many files over time.. Will one approach or the other become too slow (e.g. too many files/too many updates to track via metastore and therefore master node becomes slow?)?
Thanks.
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s always recommend not to use DML on Hive managed tables especially if the data volume is huge or if the table grows in size over time these operations would become too slow. Although, these operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort.
I have a query which is pulling data from almost 125 different tables, I have created some 13 nested Stored Procedures calling other stored procedures to pull all the required data. Surprise Surprise Surprise The query takes ages to execute and sometimes I have to kill he connection and rerun it.
I have been advised to make use of staging table, Move required data there using SSIS packages and pull data from there, but I am a bit reluctant to use SSIS as Im not very comfortable with SSIS and This report is requested once in a while and also moving around 10-15 gb data for one report seems a lot of hassle.
Any suggestion any ideas please to make this hell of task a bit simpler, quicker and less error prone ???
Create a reporting database. On some frequency, be that hourly, daily, or whatever frequency meets the needs of the reports users, ETL the data from the transactional database into the reporting database.
You can use SSIS or you could choose to execute some stored procedures for ETL. Regardless, you probably will schedule it with a SQL Agent Job.
Finally, in terms of designing your report database, consider transforming the data in a way that will help the reports performance. Many people "flatten" or de-normalize data for the purpose of reporting. We ETL transactional data into a data warehouse that uses the "star schema" pattern and we also have an Analysis Services Database and MDX Reports as well. Most likely you don't need to go that far for one report, but, that is further down this same path of optimized data structures for reporting and BI.
I develop some application that has database wery generic so really can't use it for reporting. So I need solution how to create reporting. I'm developer so my knowledge in DBA domain is bounded. For now I have ideo to create another database where I'll pu denormalized data from original db. So I saw that I could use SSIS for that and woul be glad if someone could give me some advice how to attack that problem. Should I sync data once a day and run reports that way. Is there solution to sync data allways so reports would be up to date? Please any advice.. Thanks!
Damir,
What I get from your message is that you are getting close to build a Datawarehouse using a Star Schema pattern.
You could have two databases, One with normalized data and the other one with the Star Schema pattern (Your DW), and then create a script that would use your normalized data and put them in your datawarehouse. For the frequency of your script it is up to you : After each transaction, every hour, once a day, etc...
The advantage of having a datawarehouse is that you will be able to use OLAP cubes and the MDX language for your reports. It's a plus !
Hope it could help,
If you are on sql server 2005 or greater, explore Merge statement.
For smaller tables, just truncate and reload. 'Smaller' could be subjective - but if takes less than 2-3 minutes to load, that could be termed as small. Obviously, during that period any query that uses such tables would fail.
I wan't sure how to word this question so I'll try and explain. I have a third-party database on SQL Server 2005. I have another SQL Server 2008, which I want to "publish" some of the data in the third-party database too. This database I shall then use as the back-end for a portal and reporting services - it shall be the data warehouse.
On the destination server I want store the data in different table structures to that in the third-party db. Some tables I want to denormalize and there are lots of columns that aren't necessary. I'll also need to add additional fields to some of the tables which I'll need to update based on data stored in the same rows. For example, there are varchar fields that contain info I'll want to populate other columns with. All of this should cleanse the data and make it easier to report on.
I can write the query(s) to get all the info I want in a particular destination table. However, I want to be able to keep it up-to-date with the source on the other server. It doesn't have to be updated immediately (although that would be good) but I'd like for it be updated perhaps every 10 minutes. There are 100's of thousands of rows of data but the changes to the data and addition of new rows etc. isn't huge.
I've had a look around but I'm still not sure the best way to achieve this. As far as I can tell replication won't do what I need. I could manually write the t-sql to do the updates perhaps using the Merge statement and then schedule it as a job with sql server agent. I've also been having a look at SSIS and that looks to be geared at the ETL kind of thing.
I'm just not sure what to use to achieve this and I was hoping to get some advice on how one should go about doing this kind-of thing? Any suggestions would be greatly appreciated.
For that tables whose schemas/realtions are not changing, I would still strongly recommend Replication.
For the tables whose data and/or relations are changing significantly, then I would recommend that you develop a Service Broker implementation to handle that. The hi-level approach with service broker (SB) is:
Table-->Trigger-->SB.Service >====> SB.Queue-->StoredProc(activated)-->Table(s)
I would not recommend SSIS for this, unless you wanted to go to something like dialy exports/imports. It's fine for that kind of thing, but IMHO far too kludgey and cumbersome for either continuous or short-period incremental data distribution.
Nick, I have gone the SSIS route myself. I have jobs that run every 15 minutes that are based in SSIS and do the exact thing you are trying to do. We have a huge relational database and then we wanted to do complicated reporting on top of it using a product called Tableau. We quickly discovered that our relational model wasn't really so hot for that so I built a cube over it with SSAS and that cube is updated and processed every 15 minutes.
Yes SSIS does give the aura of being mainly for straight ETL jobs but I have found that it can be used for simple quick jobs like this as well.
I think, staging and partitioning will be too much for your case. I am implementing the same thing in SSIS now but with a frequency of 1 hour as I need to give some time for support activities. I am sure that using SSIS is a good way of doing it.
During the design, I had thought of another way to achieve custom replication, by customizing the Change Data Capture (CDC) process. This way you can get near real time replication, but is a tricky thing.
I have a database server with few main databases, and few dozens of small ones.
These small databases are kind of intermediary/staging databases for data import from various sources into main database. Data import is a daily task. They are all quite similar in structure as the implementation of these data imports are similar, so basically they have a configuration tables, which define mapping, conversions etc, and the data tables, which contain the results of the import.
Some time ago there have been only the handful of small ones, but now I have more then 20 of them will grow further with the number of supported data feeds.
I have just migrated all the server environment to SQL Server 2008, and having some time now for clean-up/refactoring, I am thinking to merge all of data-import databases into just one database, and use database schema to separate them.
Question-0: Any other ideas for the described situation?
Question-1: Shall I change from a separate database to a separate schema?
Question-2: !!!: Any tricky thing to be careful about in database schema implementation?
Edit-1: highlighted question-2 as the most 'unanswered' currently.
In your instance, I would probably put merge the databases into one. I don't really see a reason to have them separated, and merging them will reduce the amount of work you have to do to support backups etc. If you were importing data from a data source once and then never using the staging tables again, I could see the reason to bring up separate databases to handle the data transformation. Since you use these tables on an ongoing basis, I would much rather keep them together so that I only have to go to one place to find the full end to end state of the production data and the data load states.
2008 is really good at handling database partitioning too, if the db gets too large, or you need to separate data for security reasons you get the benefit of having a single db with the advantages like having several smaller ones. You won't get that with multiple smaller dbs.
When we migrated we had a very similar situation and I ended up moving everything into one some-what large Importing database like you have hinted towards. We did not, however, separate them using schemas.
Because the database is the unit of referential integrity and backup, if you are bringing in large amounts of data for staging which does not need to be backed up on the same schedule, it might be easiest to keep it in a separate DB.
You can use a single DB with multiple file groups and different backups, but it will require a lot more design.
The basic factors this will depend on are: recovery model, backup objectives, usage patterns and amount of effort to design and maintain your file group design.
All the prior answers work for me, particularly your comment about selectively combining databases -- if some are very busy, very large, or process sensitive data, you might want to keep them separate, or in separate groupings. This would make it easier to configure backups/restores and disk/drive allocation (give the busy ones their own set of spindles).
Like possibly most database developers, I have dealt almost exclusively with objects in the dbo schema, but I have done some recent work with other schemas. The main gotcha I've encountered is remembering to always specify the schema when referring to any database object. Never assume that any given connection will reference an object in the schema you want it to--always be clear and precise!
I would put all your import staging tables in one database separate from your regular production databse as the backup needs may be very different. This database should also contains things like your configuration management for SSIS packages, any logging tables, any import metadata tables (we keep track of every run of the imports and the status of that run as well as a bazillion other things about the import like the filename, the normal file size, etc. Comes in handy for researching problems and for adding checks to the processing. We usea a schema that is by client and then an additional schema for objects realted to the importing/exporting process (logs, meta data etc.)