Does CQRS With OLTP and OLAP Databases Make Sense? - api

I have several OLTP databases with API's talking to them. I also have ETL jobs pushing data to an OLAP database every few hours.
I've been tasked with building a custom dashboard showing hight level data from the OLAP database. I want to build several API's pointing to the OLAP database. Should I:
Add to my existing API's and call the OLAP database and use a CQRS type pattern, so reads come from OLAP, while writes come from OLTP. My concern here is that there could be a mismatch in the data between reads and writes. How mismatched the data is depends on how often you run the ETL jobs (Hours in my case).
Add to my existing API's and call the OLAP databases then ask the client to choose whether they want OLAP or OLTP data where API's overlap. My concern here is that the client should not need to know about the implementation detail of where the data is coming from.
Write new API's that only point to the OLAP database. This is a lot of extra work.

Don't use #1: when management talk of analyzed reports it don't bother data mismatch between ETL process - obviously you will generate a CEO report after finishing ETL for the day
Don't use #2: this way you'll load transnational system with analytic overhead and dissolve isolation between purpose of two systems (not good for operation and maintenance)
Use #3 as its the best way to fetch processed results, Use modern tools like Excel, PowerQuery, PowerBI to allow you to create rich dashboard with speed instead of going into tables and writing APIs.

Related

Creating a Datawarehouse

Currently our team is having a major database management/data management issue where hundreds of databases are being built and used for minor/one off applications where the app should really be pulling from an already existing database.
Since our security is so tight, the owners of these Systems of authority will not allow others to pull data from them at a consistent (App Necessary) rate, rather they allow a single app to do a weekly pull and that data is then given to the org.
I am being asked to compile all of those publicly available (weekly snapshots) into a single data warehouse for end users to go to. We realistically are talking 30-40 databases each with hundreds of thousands of records.
What is the best way to turn this into a data warehouse? Create a SQL server and treat each one as its own DB on the server? As far as the individual app connections I am less worried, I really want to know what is the best practice to house all of the data for consumption.
What you're describing is more of a simple data lake. If all you're being asked for is a single place for the existing data to live as-is, then sure, directly pulling all 30-40 databases to a new server will get that done. One thing to note is that if they're creating Database Snapshots, those wouldn't be helpful here. With actual database backups, it would be easy to build a process that would copy and restore those to your new server. This is assuming all of the sources are on SQL Server.
"Data warehouse" implies a certain level of organization beyond that, to facilitate reporting on an aggregate of the data across the multiple sources. Generally you'd identify any concepts that are shared between the databases and create a unified table for each concept, then create an ETL (extract, transform, load) process to standardize the data from each source and move it into those unified tables. This would be a large lift for one person to build. There's plenty of resources that you could read to get you started--Ralph Kimball's The Data Warehouse Toolkit is a comprehensive guide.
In either case, a tool you might want to look into is SSIS. It's good for copying data across servers and has drivers for multiple different RDBMS platforms. You can schedule SSIS packages from SQL Agent. It has other features that could help for data warehousing as well.

Staging tables in DB vs storage area

Typically on an on-premise SQL server ETL workflow via SSIS, we load data from anywhere into staging tables and then apply validation and transformations to load/merge them into downstream data warehouse tables.
My question is if we should do something similar on Azure, where we have set of staging tables and downstream tables in azure SQL database or use azure storage area as staging and move data from there into final downstream tables via ADF.
As wild is it may seem, we also have a proposal to have separate staging database and downstream database, between which we move using ADF.
There are different models for doing data movement pipelines and no single one is perfect. I'll make a few comments on the common patterns I see in case that will help you make decisions on your application.
For many data warehouses where you are trying to stage in data and create dimensions, there is often a process where you load the raw source data into some other database/tables as raw data and then process it into the format you want to insert into your fact and dimension tables. That process is complicated by the fact that you may have data arrive late or data that is corrected on a later day, so often these systems are designed using partitioned tables on the target fact tables to allow re-processing of a partition worth of data (e.g. a day) without having to reprocess the whole fact table. Furthermore, the transformation process on that staging table may be intensive if the data itself is coming in a form far away from how you want to represent it in your DW. Often in on-premises systems, these are handled in a separate database (potentially on the same SQL Server) to isolate it from the production system. Furthermore, it is sometimes the case that these staging tables are re-creatable from original source data (CSV files or similar), so it is not the store of record for that source material. This allows you to consider using simple recovery mode on that database (which reduces the Log IO requirements and recovery time compared to full recovery). While not every DW uses full recovery mode for the processed DW data (some do dual load to a second machine instead since the pipeline is there), the ability to use full recovery plus physical log replication (AlwaysOn Availability Groups) in SQL Server gives you the flexibility to create a disaster recovery copy of the database in a different region of the world. (You can also do query read scale-out on that server if you would like). There are variations on this basic model, but a lot of on-premises systems have something like this.
When you look at SQL Azure, there are some similarities and some differences that matter when considering how to set up an equivalent model:
You have full recovery on all user databases (but tempdb is in simple recovery). You also have quorum-commit of your changes to N replicas (like in Availability Groups) when using v-core or premium dbs which matters a fair amount because you often have a more generic network topology in public cloud systems vs. a custom system you build yourself. In other words, log commit times may be slower than your current system. For batch systems it does not necessarily matter too much, but you need to be careful to use large enough batch sizes so that you are not waiting on the network all the time in your application. Given that your staging table may also be a SQL Azure database, you need to be aware that it also has quorum commit so you may want to consider which data is going to stay around day-over-day (stays in SQL Azure DB) vs. which can go into tempdb for lower latencies and be re-created if lost.
There is no intra-db resource governance model today in SQL Azure (other than elastic pools which is partial and is targeting a different use case than DW). So, having a separate staging database is a good idea since it isolates your production workload from the processing in the staging database. You avoid noisy neighbor issues with your primary production workload being impacted by the processing of the day's data you want to load.
When you provision machines for on-premises DW, you often buy a sufficiently large storage array/SAN that you can host your workload and potentially many others (consolidation scenarios). The premium/v-core DBs in SQL Azure are set up with local SSDs (with Hyperscale being the new addition where it gives you some cross-machine scale-out model that is a bit like a SAN in some regards). So, you would want to think through the IOPS required for your production system and your staging/loading process. You have the ability to choose to scale up/down each of these to better manage your workload and costs (unlike a CAPEX purchase of a large storage array which is made up front and then you tune workloads to fit into it).
Finally, there is also a SQL DW offering that works a bit differently than SQL Azure - it is optimized for larger DW workloads and has scale-out compute with the ability to scale that up/down as well. Depending on your workload needs, you may want to consider that as your eventual DW target if that is a better fit.
To get to your original question - can you run a data load pipeline on SQL Azure? Yes you can. There are a few caveats compared to your existing experiences on-premises, but it will work. To be fair, there are also people who just load from CSV files or similar directly without using a staging table. Often they don't do as many transformations, so YMMV based on your needs.
Hope that helps.

wso2cep : Data Storage in addition to display

I was wondering if in addition to process and display data on dashboard in wso2cep, can I store it somewhere for a long period of time to get further information later? I have studied there are two types of tables used in wso2cep, in-memory and rdbms tables.
Which one should I choose?
There is one more option that is to switch to wso2das. Is it a good approach?
Is default database is fine for that purpose or I should move towards other supported databases like sql, orcale etc?
In-memory or RDBMS?
In-memory tables will internally use java collections structures, so it'll get destroyed once the JVM is terminated (after server restart, data won't be available). On the other hand, RDBMS tables will persist data permanently. For your scenario, I think you should proceed with RDBMS tables.
CEP or DAS?
CEP will only provide real-time analytics, where DAS provides batch analytics (with Spark SQL) in addition to real-time analytics. If you have a scenario which require batch processing, incremental processing, etc ... You can go ahead with DAS. Note that, migration form CEP to DAS is quite simple (since the artifacts are identical).
Default (H2) DB or other DB?
By default WSO2 products use embedded H2 DB as data source. However, it's recommended to use MySQL or Oracle in production environments.

Do SQL targeted BI solutions like Looker and Chart.io use OLAPs?

I know that OLAP is used in Power Pivot, as far as I know, to speed up interacting with data.
But I know that big data databases like Google BigQuery and Amazon RedShift have appeared in the last few years. Do SQL targeted BI solutions like Looker and Chart.io use OLAPs or do they rely on the speed of the databases?
Looker relies on the speed of the database but does model the data to help with speed. Mode and Periscope are similar to this. Not sure about Chartio.
OLAP was used to organize data to help with query speeds. While used by many BI products like Power Pivot and Pentaho, several companies have built their own ways of organizing data to help with query speed. Sometimes this includes storing data in their own data structures to organize the data. Many cloud BI companies like Birst, Domo and Gooddata do this.
Looker created a modeling language called LookML to model data stored in a data store. As databases are now faster than they were when OLAP was created, Looker took the approach of connecting directly to the data store (Redshift, BigQuery, Snowflake, MySQL, etc) to query the data. The LookML model allows the user to interface with the data and then run the query to get results in a table or visualization.
That depends. I have some experience with BI solution (for example, we worked with Tableau), and it can operate is two main modes: It can execute the query against your server, or can collect the relevant data and store it on the user's machine (or on the server where the app installed). When working with large volumes, we used to make Tableau query the SQL Server itself, that's because our SQL Server machine is very strong compared to the other machines we had.
In any way, even if you store the data locally and want to "refresh" it, when it updates the data it needs to retrieve it from the database, which sometimes can also be an expensive operation (depends on how your data is built and organized).
You should also notice that you compare 2 different families of products: while Google BigQuery and Amazon's RedShift are actually database engines that used to store the data and also query it, most of the BI and reporting solutions are more concerend about querying the data and visualizing it and therefore (generally speaking) are less focused on having smart internal databases (at least from my experience).

SSAS - MSBI - Solution - Suggestions

Is it correct in my understanding that we can build SSAS cubes sourcing from the transaction Systems? I meant the not the live but copy of the Live.
I'm trying to see if there is any scope to address few reporting needs without the need to build a traditional Data Warehouse and then build cubes on top of the data warehouse, instead build cubes to do Financial monthly aggregated reporting needs sourcing from backup copy of the Transaction systems.
Alternatively, if you have any better way to proceed please suggest.
Regards,
KK
You can create a set of views on top of you transactional system tables and then build your SSAS cubes ontop of those views. This would be less effort than creating a fully fledged datawarehouse.
I am a data warehouse developer (and therefore believe in cubes), but not every reporting solution warrants the cost of building a cube. If your short to medium term reporting requirements are fixed and you don't have users requiring data to be sliced differently each week, then a series of fixed reports may suffice.
You can create a series of SQL Server Reporting Services reports (or extract to Excel) either directly against your copied transactional data, or against a series of summarised tables that are created periodically. If you decide to utilise a series of pre-formatted reporting tables, try to create tables that cover multiple similar reports (rather than 1 monthly report table = 1 report) for ease of ongoing maintenance.
There are many other important aspects to this that you may need to consider first. Like how busy is the transaction system, what is the size of the data, concurrency and availability issues etc.
It is absolutely fine to have a copy of your live data and then build a report on the top of it. Bear in mind that the data you see in the report will not be the latest and there will be a latency factor depending on the frequency of your data pull.