This fairly unusual, but because of office politics etc. we have (read-only) access to the data warehouse, but not to the live data. However we need up-to-date data to populate our relational (OLTP) database (MS SQL server. The data in the warehouse (also MS) is in star schema format (i.e. Dimensions and Facts). I am not very familiar with warehouse DBs. How can I get data from the warehouse into a relational database? My google-fu was too weak to get me any answers (lots for the other way round).
Thanks
Chavoux
If there is a data-warehouse, then the ETL process is already in place. So, use the same tool (SSIS ?) that loads the DW to extract data from the DW and move it to a different DB. You can probably ask your ETL guy to help too :).
SSIS packages? Anything that can load and transform data.
Sounds like the data warehouse is a SQL Server database so standard SQL can be used. You can use SSIS to transform the data from the DW and load it to the OLTP database.
Related
How can i replicate (incremental load) MongoDB (NoSQL) to SQL tables.
We have a web-based solution that loading data into MongoDB. The data size is almost 1TB. We need to do BI Reporting in the Looker BI tool. but looker doesn't support MongoDB directly. So we have to replicate our data into SQL form we have redshift for the target database.
Main requirements for parsing NoSQL to SQL:
Parent Node should be the main table
Nested node/arrays should be a separate table with parent key (foreign key)
Whenever a new column is introduced in MongoDB source it should automatically start replicating that new field from any document to the target database.
Incremental refresh from source to target.
I've seen Stitch Data ETL which fits my requirement but I'm looking for OpenSource any ETL/DB tool or library.
Please help.
Posting answers to help out others with the same requirements.
I'm not able to get any open source ETL tool who can full fill the above 4 requirements.
Trying to writing python code to do so. But a paid tool named Precog helped me to fulfill all the above requirements, and a little bit cheaper than Stitch Data ETL.
Thanks
The way we use data is either retrieving survey data from other organizations, or creating survey instruments ourselves and soliciting organizations under our organization for data.
We have a database where our largest table is perhaps 10 million records. We extract and upload most of our data on an annual basis, with occasionally needing to ETL over large numbers of tables from organizations such as the Census, American Community Survey, etc. Our database is all on Azure and currently the way that I get databases from Census flat files/.csv files is by re-saving them as Excel and using the Excel import wizard.
All of the 'T' in ETL is happening within programmed procedures within my staging database before moving those tables (using Visual Studio) to our reporting database.
Is there a more sophisticated technology I should be using, and if so, what is it? All of my education in this matter comes from perusing Google and watching YouTube, so my grasp on all of the different terminology is lacking and searching on the internet for ETL is making it difficult to get to what I believe should be a simple answer.
For a while I thought we wanted to eventually graduate to using SSIS, but I learned that SSIS was something that was used primarily if you had a database on prem. I've tried looking at dynamic SQL using BULK INSERT to find that BULK INSERT doesn't work with Azure DBs. Etc.
Recently I've been learning about Azure Data Factory and something called Bulk Copy Program using Windows Power Shell.
Does anybody have any suggestions as to what technology I should look at for a small-scale BI reporting solution?
I suggest you using the Data Factory, it has good performance for the large data transfer.
Refence here: Copy performance and scalability achievable using ADF
Copy Active supports you using table data, query or stored procedure to filter data in Source:
Sink support you select the destination table, stored procedure or auto create table(bulk insert) to receive the data:
Data Factory Mapping Data Flow provides more features for the data convert.
Ref: Copy and transform data in Azure SQL Database by using Azure Data Factory.
Hope this helps.
I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.
The company i am working for is implementing Share-point with reporting servers that runs on an SQL back end. The information that we need lives on two different servers. The first server being the Manufacturing server that collects data from PLCs and inputs that information into a SQL database, the other server is our erp server which has data for payroll and hours worked on specific projects. The i have is to create a view on a separate database and then from there i can pull the information from both servers. I am having a little bit of trouble with the syntax for connecting the two servers to run the View. We are running ms SQL. If you need any more information or clarification please let me know.
Please read this about Linked Servers.
Alternatively you can make a Data Warehouse - which would be a reporting data base. You can feed this by either making procs with linked servers or use SSIS packages if they're not linked.
It all depends on a project size and complexity, but in many cases it is difficult to aggregate data from multiple sources with Views. The reason is that the source data structure is modeled for the source application and not optimized for reporting.
In that case, I would suggest going with an ETL process, where you would create a set of Extract, Transform and Load jobs to get data from multiple sources (databases) into a target database where data will be stored in the format optimized for reporting.
Ralph Kimball has many great books on the subject, for example:
1) The Data Warehouse ETL Toolkit
2) The Data Warehouse Toolkit
They are truly worth the read if you are dealing with data
I develop some application that has database wery generic so really can't use it for reporting. So I need solution how to create reporting. I'm developer so my knowledge in DBA domain is bounded. For now I have ideo to create another database where I'll pu denormalized data from original db. So I saw that I could use SSIS for that and woul be glad if someone could give me some advice how to attack that problem. Should I sync data once a day and run reports that way. Is there solution to sync data allways so reports would be up to date? Please any advice.. Thanks!
Damir,
What I get from your message is that you are getting close to build a Datawarehouse using a Star Schema pattern.
You could have two databases, One with normalized data and the other one with the Star Schema pattern (Your DW), and then create a script that would use your normalized data and put them in your datawarehouse. For the frequency of your script it is up to you : After each transaction, every hour, once a day, etc...
The advantage of having a datawarehouse is that you will be able to use OLAP cubes and the MDX language for your reports. It's a plus !
Hope it could help,
If you are on sql server 2005 or greater, explore Merge statement.
For smaller tables, just truncate and reload. 'Smaller' could be subjective - but if takes less than 2-3 minutes to load, that could be termed as small. Obviously, during that period any query that uses such tables would fail.