Storing the data to Azure storage, blob or table? - sql

I have a Web app with database where the consumption data are stored in SQL database. I want to consolidate the data data older than 3 months to SQL database and save the unconsolidated data to storage. The data will not be often accessed because the consolidated info will be available in SQL it's only for purpose that somethink will go wrong. It is better to use table or blob storage? Thanks for your advices.
The data will be accesed separately or based on from which building they are comming. For example I have building A and someone comes and want to know the detailed consumption for a week or day half year ago. I will go to storage and get the data. The data in SQL are stored every 5 minutes.

You can use either blob storage or table storage for this purpose but I am more inclined towards using table storage to store this data.
The reason being you would want some kind of querying capability which is only offered by table storage. With blob storage, you would need to download all the relevant data on the client side, parse that data to create some kind of collection and then query that collection. With table storage, you can execute server-side queries.
If you're going with table storage, my recommendation would be to use date/time value (with date precision) as the PartitionKey. This will make searching for data by date/time much faster.
If you're going with blob storage, my recommendation would be to use a Cool Storage account for saving this data. Since you would rarely need this data, storing it in a Cool Storage account would be cheaper than a regular storage account.

Related

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

AWS Glue sync data from RDS (need to sync 4 table from all schema) to S3 (apache parque format)

We are using a Postgres RDS instance (db.t3.2xlarge with around 2TB data). We have a multi-tenancy application so for all organizations who sign up in our product, we are creating a separate schema which replicates our data model. Now a couple of our schemas (around 5 to 10 schemas) contain a couple of big tables (around 5 to 7 big tables where each contains 10 to 200 million rows). For UI we need to show some statics as well as graphs and to calculate that statics as well as graph data we need to perform joins on big tables and it slows down our whole database server. Sometimes we need to do this type of query in night time so that users don't face any performance issues. So ss a solution we are planning to create a data lake in S3 so that all analytical load we can shift out of RDBMS and to an OLAP solution.
As a first step we need to transfer our data from RDS to S3 and also keep syncing both data sources. Can you please suggest which tool is a better choice for us considering the below requirements:
We need to update the last 3 days data on an hourly basis. We want to keep updating recent data because over the 3 day time window, it may change. After 3 days we can consider the data “at rest” and it can rest in the data lake without any future modification.
We are using a multi tenancy system currently and we are having ~350 schemas, But it will be increasing as more organizations sign up in our product.
We are planning to do ETL so in transform we are planning to join all tables and create one denormalized table and store the data in apache parque format in S3. So that we can perform analytical queries on that table using Redshift Spectrum, EMR, or some other tool.
I just found out about AWS Data Lake recently, and also based on my research (which will hopefully, assist you in the best solution possible)..
AWS Athena can partition data, and you may want to partition your data based on tenant id (customer id).
AWS Glue has crawlers:
Crawlers can run periodically to detect the availability of new data
as well as changes to existing data, including table definition
changes.

Archival solution for BigQuery tables in a Datawarehouse

Given a usecase for building a DataWarehouse using BigQuery, say if a monthly backup needs to be taken for all the BigQuery tables. What would be the best option?
Export all the table data to Cloud storage (csv?)
Copy all the tables to a different dataset (and in a different project may be)
What would be the best option, considering the cost and maintenance? And also please share if any other options.
When moving data from BigQuery to GCS, you are not charged for the export and load operation, as mentioned in the Free operation documentation; however, you incur on charges for storing data in GCS that will depend on the type of the storage selected. This service offers the Multi-Regional, Regional, Nearline and Coldline options that you can choose based on the frequency that you need to access to the stored data.
Based on this, in case you want to make your back-ups and don't access the data in a frequent basis, you can store your data in GCS with a Coldline/Nearline storage or use the Long-term storage in BigQuery, that is automatically applied when table is not edited for 90 consecutive days, and that would be some cheaper options. On the other hand, if you plan to use your data actively, it may be better to use BigQuery with Active Storage which will cost you similar than if you store it in GCS with a Regional storage; nevertheless, it will depend on your specific use-cases and the way you want to interact with your data.
Regarding the ingestion file format, BigQuery support a variety of formats that you can use to load your data. I suggest you to check this documentation that can help you to decide the format that best fit to your current scenario based on your data structure.

Transactional data in data lake

We have multiple source systems sending data. Ideally we should capture the raw data coming from sources and keep it in data lake. Then we have to process the raw data into a structured format. Now users can update this data via a front end application.
I am thinking of putting a rdbms on top of processed data and then pull the audit trails from rdbms to data lake and merge processed data and audit trails to create the final view for reporting. Or the rdbms can also be used for analytics as well.
Or we can bring in all the data originally in rdbms and run the changes in rdbms and pull data from rdbms into data lake. But this doesn't make much sense to bring in data lake.
Kindly suggest.
Thanks,
ADLA is NOT consumer oriented, meaning you would not connect a front-end system to it.
If the question is "what should we do", I'm not sure anyone can answer that for you, but it sounds like you are on the right track.
What I can do is tell you what we do:
Raw data (CSV or TXT files) come in to Blob Storage
U-SQL scripts extract that data and store it in Data Lake Analytics
tables. [Blobs can be deleted at that point].
We output processed data as required to "consumable" sources like RDBMS. There
are several ways to do this, but currently we output to pipe delimited text files in blob storage and use Polybase to import to SQL Server. YMMV.
Pulling the data into Data Lake first and RDBMS second makes sense to me.

Where the data will be stored by BigQuery

I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.