I know that OLAP is used in Power Pivot, as far as I know, to speed up interacting with data.
But I know that big data databases like Google BigQuery and Amazon RedShift have appeared in the last few years. Do SQL targeted BI solutions like Looker and Chart.io use OLAPs or do they rely on the speed of the databases?
Looker relies on the speed of the database but does model the data to help with speed. Mode and Periscope are similar to this. Not sure about Chartio.
OLAP was used to organize data to help with query speeds. While used by many BI products like Power Pivot and Pentaho, several companies have built their own ways of organizing data to help with query speed. Sometimes this includes storing data in their own data structures to organize the data. Many cloud BI companies like Birst, Domo and Gooddata do this.
Looker created a modeling language called LookML to model data stored in a data store. As databases are now faster than they were when OLAP was created, Looker took the approach of connecting directly to the data store (Redshift, BigQuery, Snowflake, MySQL, etc) to query the data. The LookML model allows the user to interface with the data and then run the query to get results in a table or visualization.
That depends. I have some experience with BI solution (for example, we worked with Tableau), and it can operate is two main modes: It can execute the query against your server, or can collect the relevant data and store it on the user's machine (or on the server where the app installed). When working with large volumes, we used to make Tableau query the SQL Server itself, that's because our SQL Server machine is very strong compared to the other machines we had.
In any way, even if you store the data locally and want to "refresh" it, when it updates the data it needs to retrieve it from the database, which sometimes can also be an expensive operation (depends on how your data is built and organized).
You should also notice that you compare 2 different families of products: while Google BigQuery and Amazon's RedShift are actually database engines that used to store the data and also query it, most of the BI and reporting solutions are more concerend about querying the data and visualizing it and therefore (generally speaking) are less focused on having smart internal databases (at least from my experience).
Related
I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?
I'm confused and having trouble finding examples and reference architecture where someone wants to extract data from an existing data lake (S3/Lakeformation in my case) and build a OLTP datastore that serves as an applications backend. Everything I come across is an OLAP data warehousing pattern (i.e. ETL -> S3 -> Redshift -> BI Tools) where data is always coming IN to the datalake and warehouse rather than being pulled OUT. I don't necessarily have a need for 'business analytics' but I do have a need for displaying graphs in web dashboards with large amounts of time series data points underneath for my websites users.
What if I want to automate pulling extracts of a large dataset in the datalake and build a relational database with some useful data extracts from the various datasets that need to be queried by the hand full instead of performing large analytical queries against a DW?
What if I just want an extract of say, stock prices over 10 years, and just get the list of unique ticker symbols for populating a drop down on a web app? I don't want to query an OLAP data warehouse every time to get this, so I want to have my own OLTP store for more performant queries on smaller datasets that will have much higher TPS?
What if I want to build dashboards for my web app's customers that display graphs of large amounts of time series data currently sitting in the datalake/warehouse. Does my web app connect directly to the DW to display this data? Or do I pull that data out of the datalake or warehouse and into my application DB on some schedule?
My views on your 3 questions:
Why not just use the same ETL solution that is being used to load the datalake?
Presumably your DW has a Ticker dimension that has unique records for each Ticker symbol? What's the issue with querying this as it would be very fast to get the unique Ticker symbols from it?
It depends entirely on your environment/infrastructure and what you are doing with the data - so there is no generic answer anyone could provide you with. If your webapp is showing aggregations of a large volume of data then your DW is probably better at doing those aggregations and passing the aggregated data to your webapp; if the webapp is showing unaggregated data (and only a small subset of what is held in your DW, such as just the last week's data) then loading it into your application DB might make more sense
The pros/cons of any solution would also be heavily influenced by your infrastructure e.g. what's the network performance like between your DW and application?
I have several OLTP databases with API's talking to them. I also have ETL jobs pushing data to an OLAP database every few hours.
I've been tasked with building a custom dashboard showing hight level data from the OLAP database. I want to build several API's pointing to the OLAP database. Should I:
Add to my existing API's and call the OLAP database and use a CQRS type pattern, so reads come from OLAP, while writes come from OLTP. My concern here is that there could be a mismatch in the data between reads and writes. How mismatched the data is depends on how often you run the ETL jobs (Hours in my case).
Add to my existing API's and call the OLAP databases then ask the client to choose whether they want OLAP or OLTP data where API's overlap. My concern here is that the client should not need to know about the implementation detail of where the data is coming from.
Write new API's that only point to the OLAP database. This is a lot of extra work.
Don't use #1: when management talk of analyzed reports it don't bother data mismatch between ETL process - obviously you will generate a CEO report after finishing ETL for the day
Don't use #2: this way you'll load transnational system with analytic overhead and dissolve isolation between purpose of two systems (not good for operation and maintenance)
Use #3 as its the best way to fetch processed results, Use modern tools like Excel, PowerQuery, PowerBI to allow you to create rich dashboard with speed instead of going into tables and writing APIs.
I am looking for a solution that will host a nearly-static 200GB, structured, clean dataset, and provide a JSON API onto the data, for querying in a web app.
Each row of my data looks like this, and I have about 700 million rows:
parent_org,org,spend,count,product_code,product_name,date
A31,A81001,1003223.2,14,QX0081,Rosiflora,2014-01-01
The data is almost completely static - it updates once a month. I would like to support straightforward aggregate queries like:
get total spending on product codes starting QX, by organisation, by month
get total spending by parent org A31, by month
And I would like these queries to be available over a RESTful JSON API, so that I can use the data in a web application.
I don't need to do joins, I only have one table.
Solutions I have investigated:
To date I have been using Postgres (with a web app to provide the API), but am starting to reach the limits of what I can do with indexing and materialized views, without dedicated hardware + more skills than I have
Google Cloud Datastore: is suitable for structured data of about this size, and has a baked-in JSON API, but doesn't do aggregates (so I couldn't support my "total spending" queries above)
Google BigTable: can definitely do data of this size, can do aggregates, could build my own API using App Engine? Might need to convert data to hbase to import.
Google BigQuery: fast at aggregating, would need to roll my own API as with BigTable, easy to import data
I'm wondering if there's a generic solution for my needs above. If not, I'd also be grateful for any advice on the best setup for hosting this data and providing a JSON API.
Update: Seems that BigQuery and Cloud SQL support SQL-like queries, but Cloud SQL may not be big enough (see comments) and BigQuery gets expensive very quickly, because you're paying by the query, so isn't ideal for a public web app. Datastore is good value, but doesn't do aggregates, so I'd have to pre-aggregate and have multiple tables.
Cloud SQL is likely sufficient for your needs. It certainly is capable of handling 200GB, especially if you use Cloud SQL Second Generation.
They only reason why a conventional database like MySQL (the database Cloud SQL uses) might not be sufficient is if your queries are very complex and not indexed. I recommend you try Cloud SQL, and if the performance isn't sufficient, try ensuring you have sufficient indexes (hint: use the EXPLAIN statement to see how the queries are being executed).
If your queries cannot be indexed in a useful way, or your queries are so cpu intensive that they are slow regardless of indexing, you might want to graduate up to BigQuery. BigQuery is parallelised so that it can handle pretty much as much data as you throw at it, however it isn't optimized for real-time use and isn't as conveneint as Cloud SQL's "MySQL in a box".
Take a look at ElasticSearch. It's JSON, REST, cloud, distributed, quick on aggregate queries and so on. It may or may not be what you're looking for.
The company i am working for is implementing Share-point with reporting servers that runs on an SQL back end. The information that we need lives on two different servers. The first server being the Manufacturing server that collects data from PLCs and inputs that information into a SQL database, the other server is our erp server which has data for payroll and hours worked on specific projects. The i have is to create a view on a separate database and then from there i can pull the information from both servers. I am having a little bit of trouble with the syntax for connecting the two servers to run the View. We are running ms SQL. If you need any more information or clarification please let me know.
Please read this about Linked Servers.
Alternatively you can make a Data Warehouse - which would be a reporting data base. You can feed this by either making procs with linked servers or use SSIS packages if they're not linked.
It all depends on a project size and complexity, but in many cases it is difficult to aggregate data from multiple sources with Views. The reason is that the source data structure is modeled for the source application and not optimized for reporting.
In that case, I would suggest going with an ETL process, where you would create a set of Extract, Transform and Load jobs to get data from multiple sources (databases) into a target database where data will be stored in the format optimized for reporting.
Ralph Kimball has many great books on the subject, for example:
1) The Data Warehouse ETL Toolkit
2) The Data Warehouse Toolkit
They are truly worth the read if you are dealing with data