We are looking to build a solution on GCP for campaign/ad analytics (ingest doubleclik and other ad-server data into DW). Data is ingested as batch, with star schema but will have updates trickling in for up to a week, need trend analysis for multiple clients (advertisers) and reporting. I can't decide between Google Big table which supports updates and timeseries analysis Vs Big Query which is ideal for star schema and ad-hoc analysis.
Any suggestions? Performance and flexibility are important.
You may find the following solution guide helpful for learning how to build an analysis pipeline using BigQuery and other GCP products and tools:
https://cloud.google.com/solutions/marketing-data-warehouse-on-gcp
Bigtable meanwhile is a good fit for building real-time bidding and other pieces of core ad serving infrastructure. See e.g.:
https://cloud.google.com/customers/mainad/
https://cloud.google.com/solutions/infrastructure-options-for-building-advertising-platforms
Related
please could someone tell me why some people do this after creating our data warehouse we create report (Repporting ) and Olap analysisenter image description here
my question why will we do olap analysis and we create repport what is the Beneficial of doing both of them , i think reporting is sufficient to help the client to analyse the data.But still some client ask for both .
I use Analysis services models as the source for all reporting. In your case you may have transactional reporting (large amounts of row-level data) which doesn't lend itself to the technology. Analysis services would be better suited to data which is likely to be aggregated.
Tabular models are a great way to present data to users for them to interact with as they can be designed in a way which makes them better for self-service data analytics.
I've also implemented the hybrid approach you mentioned. This can be useful if businesses have varying report requirements. For example dashbiarding could be done using power bi connected to the tabular model whereas transactional reporting such as large emailed spreadsheets could be run from the sql server (perhaps using ssrs or power bi paginated).
I'm confused and having trouble finding examples and reference architecture where someone wants to extract data from an existing data lake (S3/Lakeformation in my case) and build a OLTP datastore that serves as an applications backend. Everything I come across is an OLAP data warehousing pattern (i.e. ETL -> S3 -> Redshift -> BI Tools) where data is always coming IN to the datalake and warehouse rather than being pulled OUT. I don't necessarily have a need for 'business analytics' but I do have a need for displaying graphs in web dashboards with large amounts of time series data points underneath for my websites users.
What if I want to automate pulling extracts of a large dataset in the datalake and build a relational database with some useful data extracts from the various datasets that need to be queried by the hand full instead of performing large analytical queries against a DW?
What if I just want an extract of say, stock prices over 10 years, and just get the list of unique ticker symbols for populating a drop down on a web app? I don't want to query an OLAP data warehouse every time to get this, so I want to have my own OLTP store for more performant queries on smaller datasets that will have much higher TPS?
What if I want to build dashboards for my web app's customers that display graphs of large amounts of time series data currently sitting in the datalake/warehouse. Does my web app connect directly to the DW to display this data? Or do I pull that data out of the datalake or warehouse and into my application DB on some schedule?
My views on your 3 questions:
Why not just use the same ETL solution that is being used to load the datalake?
Presumably your DW has a Ticker dimension that has unique records for each Ticker symbol? What's the issue with querying this as it would be very fast to get the unique Ticker symbols from it?
It depends entirely on your environment/infrastructure and what you are doing with the data - so there is no generic answer anyone could provide you with. If your webapp is showing aggregations of a large volume of data then your DW is probably better at doing those aggregations and passing the aggregated data to your webapp; if the webapp is showing unaggregated data (and only a small subset of what is held in your DW, such as just the last week's data) then loading it into your application DB might make more sense
The pros/cons of any solution would also be heavily influenced by your infrastructure e.g. what's the network performance like between your DW and application?
My company wants to speed up the process of delivering reports. Internally, we have a team of 12 people working on building reports. The company is a large company with over 10,000 employees. We're asked to work on adhoc reports quite frequently, but it takes us on average 1-2 weeks to deliver these reports. Senior Execs have said that the time to deliver is too slow. An external consulting firm came into to do some discovery work and they have advised that business users should have access to the Azure Data Warehouse, so that they can directly build models in Azure Analysis services and Power BI.
The design that they have suggested is as follows:
Load data from SAP, into the Azure Data Warehouse directly.
Build our data models in the Azure DW - this means all the transformation work is done directly in Azure DW (Staging, Cleansing, Star Schema build).
Build the models in Azure Analysis Services.
Consume in Power BI.
Does this seem like a good strategy? I am new to Azure Data Warehouse and our technical lead is on paternity, so we are unable to ask for his help.
I asked the external consultant what the impact would be directly applying all transformation workloads to Azure DW, and he has said that 'it's mpp, so processing is super-fast'.
Can anyone help? My team technical lead is on paternity so we can't get hold of him.
Azure is certainly a great platform for the modern data warehousing and analytics purpose, but ADW or not requires more study. Generally speaking, you can consider two options:
Volume is not huge ( < 10TB ):
SAP -> SSIS/ADF -> Azure SQL DB -> Azure Analysis Services (as semantic layer) tabular model with DAX -> Power BI
Volume is huge ( > 10TB ):
SAP -> SSIS/ADF -> Azure SQL DW -> Azure Analysis Services semantic layer -> Power BI
Of course, the volume is just one of many factors to consider when you decide the architecture, but it is an important factor from numerous real-world experiences where MPP may not be really necessary. The actual architecture and sizing require a lot more effort to research. The above points are the very general for your reference, to have something to start with and explore further.
If you look for more technical detail to bring SAP data to Azure, you can review our blog at here http://www.aecorsoft.com/blog/2018/2/18/extract-sap-data-to-azure-data-lake-for-scale-out-analytics-in-the-cloud and here http://www.aecorsoft.com/blog/2018/4/26/use-azure-data-factory-to-bring-sap-data-to-azure.
I know that OLAP is used in Power Pivot, as far as I know, to speed up interacting with data.
But I know that big data databases like Google BigQuery and Amazon RedShift have appeared in the last few years. Do SQL targeted BI solutions like Looker and Chart.io use OLAPs or do they rely on the speed of the databases?
Looker relies on the speed of the database but does model the data to help with speed. Mode and Periscope are similar to this. Not sure about Chartio.
OLAP was used to organize data to help with query speeds. While used by many BI products like Power Pivot and Pentaho, several companies have built their own ways of organizing data to help with query speed. Sometimes this includes storing data in their own data structures to organize the data. Many cloud BI companies like Birst, Domo and Gooddata do this.
Looker created a modeling language called LookML to model data stored in a data store. As databases are now faster than they were when OLAP was created, Looker took the approach of connecting directly to the data store (Redshift, BigQuery, Snowflake, MySQL, etc) to query the data. The LookML model allows the user to interface with the data and then run the query to get results in a table or visualization.
That depends. I have some experience with BI solution (for example, we worked with Tableau), and it can operate is two main modes: It can execute the query against your server, or can collect the relevant data and store it on the user's machine (or on the server where the app installed). When working with large volumes, we used to make Tableau query the SQL Server itself, that's because our SQL Server machine is very strong compared to the other machines we had.
In any way, even if you store the data locally and want to "refresh" it, when it updates the data it needs to retrieve it from the database, which sometimes can also be an expensive operation (depends on how your data is built and organized).
You should also notice that you compare 2 different families of products: while Google BigQuery and Amazon's RedShift are actually database engines that used to store the data and also query it, most of the BI and reporting solutions are more concerend about querying the data and visualizing it and therefore (generally speaking) are less focused on having smart internal databases (at least from my experience).
We have shifted from IBM DB2 databases to having PostGRE SQL databases on the AWS Cloud. Is anyone aware of or has worked with AWS to test databases?
a) If so, what tools do you use?
b) What do you test when checking the databases in a Business Intelligence (BI) type of environment?
Anything other than just load or performance testing on it. I wish to check on Functional Testing, where I validation/verify that the data on the Cloud Servers and Databases is equivalent to the Data in the physical Servers with DB2 as the database.
So, mainly a kind of data reconciliation, but with ETL also involved.
Our product Ajilius (http://ajilius.com) does 90% of what you're after. We specialise in cloud data warehouse automation. PostgreSQL is our primary DBMS for on-premise and SMP data warehouses; Redshift is one of our cloud platforms (as well as Snowflake and Azure SQL Data Warehouse); and DB2 is a supported data source.
I say "90%" because our data warehouse migration feature reconciles data that is migrated between warehouses, but only when both warehouses were created by Ajilius. I'd like to understand more about your need, if you email me through our web site we can talk it over in detail.
Two competitors - Matillion and Treasure Data - also work in this space. Matillion is a full ETL tool, Treasure Data is more "EL" without the T. Definitely look at them, they're both good products with different approaches.