Archival solution for BigQuery tables in a Datawarehouse

Archival solution for BigQuery tables in a Datawarehouse - google-bigquery

Given a usecase for building a DataWarehouse using BigQuery, say if a monthly backup needs to be taken for all the BigQuery tables. What would be the best option?
Export all the table data to Cloud storage (csv?)
Copy all the tables to a different dataset (and in a different project may be)
What would be the best option, considering the cost and maintenance? And also please share if any other options.

When moving data from BigQuery to GCS, you are not charged for the export and load operation, as mentioned in the Free operation documentation; however, you incur on charges for storing data in GCS that will depend on the type of the storage selected. This service offers the Multi-Regional, Regional, Nearline and Coldline options that you can choose based on the frequency that you need to access to the stored data.
Based on this, in case you want to make your back-ups and don't access the data in a frequent basis, you can store your data in GCS with a Coldline/Nearline storage or use the Long-term storage in BigQuery, that is automatically applied when table is not edited for 90 consecutive days, and that would be some cheaper options. On the other hand, if you plan to use your data actively, it may be better to use BigQuery with Active Storage which will cost you similar than if you store it in GCS with a Regional storage; nevertheless, it will depend on your specific use-cases and the way you want to interact with your data.
Regarding the ingestion file format, BigQuery support a variety of formats that you can use to load your data. I suggest you to check this documentation that can help you to decide the format that best fit to your current scenario based on your data structure.

Related

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!

One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

Tableau visualization - Performance issue with huge data

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.

I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

De-Identifying PHI For HIPAA

I have a SQL DB which contains PHI, hosted on AWS. I want to access this data to perform analytics, however, I must de-identify the data first to comply with HIPAA.
How should I approach this? I have thought of a few approaches:
Simply de-identify the DB with SQL commands.
From now on, every time the DB is added to, add a de-identified version of that data to another DB. Then access this DB for analytics.
From now on, every time the DB is added to, add a de-identified version of that data to another table in that DB. Then access this table with SQL commands for analytics.
Which is the best approach to use to maintain compliance with HIPAA? Or, is there a better way?
Thanks!

Budget allowing, consider doing your analytics on a different system and during the ETL, de-identify the data. Changing the source system to accommodate this requirement will increase complexity to maintain and likely affect other integrations - might end up with a monolith.
There's various ways to do this: You could do a AWS DMS (with ongoing replication) with the DB as your source and S3 as target (parquet format). From there you could use Athena for analytics as jarmod highlighted, which also supports parquet format and you can even use SQL-like queries in Athena to analyze your data. There's also Redshift, send to another Relational DB, other analytics platforms etc.

How to minimize cost per SQL query execution in BigQuery

I am new to BigQuery and GCP. I am working with a (big) public data set available in BigQuery on which I am running a SQL query - it selects a bunch of data from one of the tables in the dataset, based on a simple where clause.
I then proceed to perform additional operations on the obtained data. I only need to run this query once a month, the other operations need to be run more often (hourly).
My problem is that every time I do this, it causes BigQuery to process 4+ million rows of data, and the cost of running this query is quickly adding up for me.
Is there a way I can run the SQL query and export the data to another
table/database in GCP, and then run my operations on that exported
data?
Am I correct in assuming (and I could be wrong here) that once I
export data to standard SQL DB in GCP, the cost per query will be
less in that exported database than it is in BigQuery?
Thanks!

Is there a way I can run the SQL query and export the data to another table/database in GCP, and then run my operations on that exported data?
You can run your SQL queries and therefore export the data into another table/databases in GCP by using the Client Libraries for BigQuery. You can also refer to this documentation about how to export table data using BigQuery.
As for the most efficient way to do it, I will proceed by using both BigQuery and Cloud SQL (for the other table/database) APIs.
The BigQuery documentation has an API example for extracting a BigQuery table to your Cloud Storage Bucket.
Once the data is in Cloud Storage, you can use the Cloud SQL Admin API to import the data into your desired database/table. I attached documentation regarding the best practices on how to import/export data within Cloud SQL.
Once the data is exported you can delete the residual files from your Cloud Storage Bucket, using the console, or interacting with the Cloud Storage. API
Am I correct in assuming (and I could be wrong here) that once I export data to standard SQL DB in GCP, the cost per query will be less in that exported database than it is in BigQuery?
As for the prices, you will find here how to estimate storage and query costs within BigQuery. As for other databases like Cloud SQL, here you will find more information about the Cloud SQL pricing.
Nonetheless, as Maxim point out, you can refer to both the best practices within BigQuery in order to maximize efficiency and therefore minimizing cost, and also the best practices for using Cloud SQL.
Both can greatly help you minimize cost and be more efficient in your queries or imports.
I hope this helps.

Where the data will be stored by BigQuery

I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?

Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview

Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.

BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas