Tableau visualization - Performance issue with huge data - amazon-s3

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.

I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

Related

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

De-Identifying PHI For HIPAA

I have a SQL DB which contains PHI, hosted on AWS. I want to access this data to perform analytics, however, I must de-identify the data first to comply with HIPAA.
How should I approach this? I have thought of a few approaches:
Simply de-identify the DB with SQL commands.
From now on, every time the DB is added to, add a de-identified version of that data to another DB. Then access this DB for analytics.
From now on, every time the DB is added to, add a de-identified version of that data to another table in that DB. Then access this table with SQL commands for analytics.
Which is the best approach to use to maintain compliance with HIPAA? Or, is there a better way?
Thanks!
Budget allowing, consider doing your analytics on a different system and during the ETL, de-identify the data. Changing the source system to accommodate this requirement will increase complexity to maintain and likely affect other integrations - might end up with a monolith.
There's various ways to do this: You could do a AWS DMS (with ongoing replication) with the DB as your source and S3 as target (parquet format). From there you could use Athena for analytics as jarmod highlighted, which also supports parquet format and you can even use SQL-like queries in Athena to analyze your data. There's also Redshift, send to another Relational DB, other analytics platforms etc.

How to minimize cost per SQL query execution in BigQuery

I am new to BigQuery and GCP. I am working with a (big) public data set available in BigQuery on which I am running a SQL query - it selects a bunch of data from one of the tables in the dataset, based on a simple where clause.
I then proceed to perform additional operations on the obtained data. I only need to run this query once a month, the other operations need to be run more often (hourly).
My problem is that every time I do this, it causes BigQuery to process 4+ million rows of data, and the cost of running this query is quickly adding up for me.
Is there a way I can run the SQL query and export the data to another
table/database in GCP, and then run my operations on that exported
data?
Am I correct in assuming (and I could be wrong here) that once I
export data to standard SQL DB in GCP, the cost per query will be
less in that exported database than it is in BigQuery?
Thanks!
Is there a way I can run the SQL query and export the data to another table/database in GCP, and then run my operations on that exported data?
You can run your SQL queries and therefore export the data into another table/databases in GCP by using the Client Libraries for BigQuery. You can also refer to this documentation about how to export table data using BigQuery.
As for the most efficient way to do it, I will proceed by using both BigQuery and Cloud SQL (for the other table/database) APIs.
The BigQuery documentation has an API example for extracting a BigQuery table to your Cloud Storage Bucket.
Once the data is in Cloud Storage, you can use the Cloud SQL Admin API to import the data into your desired database/table. I attached documentation regarding the best practices on how to import/export data within Cloud SQL.
Once the data is exported you can delete the residual files from your Cloud Storage Bucket, using the console, or interacting with the Cloud Storage. API
Am I correct in assuming (and I could be wrong here) that once I export data to standard SQL DB in GCP, the cost per query will be less in that exported database than it is in BigQuery?
As for the prices, you will find here how to estimate storage and query costs within BigQuery. As for other databases like Cloud SQL, here you will find more information about the Cloud SQL pricing.
Nonetheless, as Maxim point out, you can refer to both the best practices within BigQuery in order to maximize efficiency and therefore minimizing cost, and also the best practices for using Cloud SQL.
Both can greatly help you minimize cost and be more efficient in your queries or imports.
I hope this helps.

What is the generic term for technologies like AWS Athena (Presto) and GCP BigQuery?

From a user perspective, Athena and BigQuery both accept a sql-like query, they both query files stored on disk (without needing to have set up a relational database), and they both return results (usually very quickly). Do such technologies have a name? i.e. is there a generic term for technologies like AWS Athena and GCP BigQuery?
They are both distributed SQL Query Engines for big [in-place] data. Athena is based on Presto, which declares itself to be a Distributed SQL Query Engine for Big Data.
Apache Drill was based on the original BigQuery design and defines itself as a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
The three things that define them are the possibility of running SQL, their distributed nature so they can operate at scale for interactive queries, and the power to query data without having to ingest it first.
Note in the case of BigQuery, initially the data would need to be ingested and it is still the preferred way of working, even if querying data directly from GCS has been available for a number of years. Athena only works with external tables.
Google BigQuery is a serverless data warehouse that supports super-fast SQL queries for analyze data in parallel. Amazon Athena is a serverless interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL in parallel.
Both technologies could be considered as MPP (massively parallel processing) systems as both technologies process data analytics in parallel.

Tableau data extract refresh from Google BigQuery takes very long

We are very pleased with the combination BigQuery <-> Tableau Server with live connection. However, we now want to work with a data extract (500MB) on Tableau Server (since this datasource is not too big and is used very frequently). This takes too much time to refresh (1.5h+). We noticed that only 0.1% is query time and the rest is data export. Since the Tableau Server is on the same platform and location, latency should not be a problem.
This is similar to the slow export of a BigQuery table to a single file, which can be solved by using "daisy chain" option (wildcards). Unfortunately we can't use similar logic with a Google BigQuery data extract refresh in Tableau...
We have identified some approaches, but are not pleased with our current ideas:
Working with incremental refresh: our existing BigQuery table rows can change: these changes can only be applied in Tableau if you do a full refresh
Exporting the BigQuery table to GCS using the daisy chain option and making a Tableau data extract using the Tableau SDK: this would result in quite some overhead...
Writing a Dataflow job using a custom sink for Tableau Server (data extracts).
Experimenting with a Tableau web connector that communicates directly with the BigQuery API: I don't think this will be faster? I didn't see anything about parallelizing calls with the Tableau web connecector, but I didn't try this approach yet.
We would prefer a non-technical option, to limit maintenance... Is there a way to modify the Tableau connector to make use of the "daisy chain" option for BigQuery?
You've uploaded the data in BigQuery. Can't you just use the input for that load job (a CSV perhaps) as input for Tableau?
When we use Tableau and BigQuery we also notice that extracts are slow but we generally don't do that because you lose BigQuery's power. We start with a live data connection at first, and then (if needed) convert this into a custom query that aggregates that data into a much smaller datasets which extracts in just a few seconds.
Another way to achieve higher performance with BigQuery and Tableau is aggregating or joining tables on beforehand. JOINs on huge tables can be slow, so if you use a lot of those you might consider generating a denormalised dataset which does all of the JOIN-ing first. You will get a dataset with a lot of duplicates and a lot of columns. But if you select only what you need in Tableau (hide unused fields!) then these columns won't count in your query cost.
One recommendation I have seen is similar to your point 2 where you export the BQ table to Google Cloud Storage and then use the Tableau Extract API to create a .tde from the flat files in GCS.
This was from an article on the Google Cloud site so I'd assume it would be best practice:
https://cloud.google.com/blog/products/gcp/the-switch-to-self-service-marketing-analytics-at-zulily-best-practices-for-using-tableau-with-bigquery
There is an article here which provides a step by step guide to achieving the above.
https://community.tableau.com/docs/DOC-23161
It would be nice if Tableau optimised the BQ connector for extract refresh using the BigQuery Storage API. We too have our Tableau Server environment in the same GCP zone as our BQ datasets and experience slow refresh times.