I have few questions on ETL which I am not crystal clear which are as follows:
1)What are the components of test plan vs test strategy in detail?
2)What is the difference between ODS vs staging area in ETL? Are both present in ETL between source and target database (data warehouse) or only one present? If both are present which comes first?
ODS : Operational Data store.
It is a collection of integrated databases designed to support operational monitoring. Unlike the OLTP databases, the data in the ODS are integrated, subject oriented and enterprise wide data
Staging Area: Which comes after ODS, where it cleans operational data before loading the
data into Data warehouse.
Related
I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.
I have several OLTP databases with API's talking to them. I also have ETL jobs pushing data to an OLAP database every few hours.
I've been tasked with building a custom dashboard showing hight level data from the OLAP database. I want to build several API's pointing to the OLAP database. Should I:
Add to my existing API's and call the OLAP database and use a CQRS type pattern, so reads come from OLAP, while writes come from OLTP. My concern here is that there could be a mismatch in the data between reads and writes. How mismatched the data is depends on how often you run the ETL jobs (Hours in my case).
Add to my existing API's and call the OLAP databases then ask the client to choose whether they want OLAP or OLTP data where API's overlap. My concern here is that the client should not need to know about the implementation detail of where the data is coming from.
Write new API's that only point to the OLAP database. This is a lot of extra work.
Don't use #1: when management talk of analyzed reports it don't bother data mismatch between ETL process - obviously you will generate a CEO report after finishing ETL for the day
Don't use #2: this way you'll load transnational system with analytic overhead and dissolve isolation between purpose of two systems (not good for operation and maintenance)
Use #3 as its the best way to fetch processed results, Use modern tools like Excel, PowerQuery, PowerBI to allow you to create rich dashboard with speed instead of going into tables and writing APIs.
We have shifted from IBM DB2 databases to having PostGRE SQL databases on the AWS Cloud. Is anyone aware of or has worked with AWS to test databases?
a) If so, what tools do you use?
b) What do you test when checking the databases in a Business Intelligence (BI) type of environment?
Anything other than just load or performance testing on it. I wish to check on Functional Testing, where I validation/verify that the data on the Cloud Servers and Databases is equivalent to the Data in the physical Servers with DB2 as the database.
So, mainly a kind of data reconciliation, but with ETL also involved.
Our product Ajilius (http://ajilius.com) does 90% of what you're after. We specialise in cloud data warehouse automation. PostgreSQL is our primary DBMS for on-premise and SMP data warehouses; Redshift is one of our cloud platforms (as well as Snowflake and Azure SQL Data Warehouse); and DB2 is a supported data source.
I say "90%" because our data warehouse migration feature reconciles data that is migrated between warehouses, but only when both warehouses were created by Ajilius. I'd like to understand more about your need, if you email me through our web site we can talk it over in detail.
Two competitors - Matillion and Treasure Data - also work in this space. Matillion is a full ETL tool, Treasure Data is more "EL" without the T. Definitely look at them, they're both good products with different approaches.
Is it correct in my understanding that we can build SSAS cubes sourcing from the transaction Systems? I meant the not the live but copy of the Live.
I'm trying to see if there is any scope to address few reporting needs without the need to build a traditional Data Warehouse and then build cubes on top of the data warehouse, instead build cubes to do Financial monthly aggregated reporting needs sourcing from backup copy of the Transaction systems.
Alternatively, if you have any better way to proceed please suggest.
Regards,
KK
You can create a set of views on top of you transactional system tables and then build your SSAS cubes ontop of those views. This would be less effort than creating a fully fledged datawarehouse.
I am a data warehouse developer (and therefore believe in cubes), but not every reporting solution warrants the cost of building a cube. If your short to medium term reporting requirements are fixed and you don't have users requiring data to be sliced differently each week, then a series of fixed reports may suffice.
You can create a series of SQL Server Reporting Services reports (or extract to Excel) either directly against your copied transactional data, or against a series of summarised tables that are created periodically. If you decide to utilise a series of pre-formatted reporting tables, try to create tables that cover multiple similar reports (rather than 1 monthly report table = 1 report) for ease of ongoing maintenance.
There are many other important aspects to this that you may need to consider first. Like how busy is the transaction system, what is the size of the data, concurrency and availability issues etc.
It is absolutely fine to have a copy of your live data and then build a report on the top of it. Bear in mind that the data you see in the report will not be the latest and there will be a latency factor depending on the frequency of your data pull.
The company i am working for is implementing Share-point with reporting servers that runs on an SQL back end. The information that we need lives on two different servers. The first server being the Manufacturing server that collects data from PLCs and inputs that information into a SQL database, the other server is our erp server which has data for payroll and hours worked on specific projects. The i have is to create a view on a separate database and then from there i can pull the information from both servers. I am having a little bit of trouble with the syntax for connecting the two servers to run the View. We are running ms SQL. If you need any more information or clarification please let me know.
Please read this about Linked Servers.
Alternatively you can make a Data Warehouse - which would be a reporting data base. You can feed this by either making procs with linked servers or use SSIS packages if they're not linked.
It all depends on a project size and complexity, but in many cases it is difficult to aggregate data from multiple sources with Views. The reason is that the source data structure is modeled for the source application and not optimized for reporting.
In that case, I would suggest going with an ETL process, where you would create a set of Extract, Transform and Load jobs to get data from multiple sources (databases) into a target database where data will be stored in the format optimized for reporting.
Ralph Kimball has many great books on the subject, for example:
1) The Data Warehouse ETL Toolkit
2) The Data Warehouse Toolkit
They are truly worth the read if you are dealing with data