Syncing Azure BLOB Storage to Amazon S3 - amazon-s3

We're storing about 4 million files (4 TB or so) of miscellaneous files, mainly Word and PDF, in Azure BLOB storage. I'm looking to replicate this data in a different cloud for disaster recovery and peace of mind, and Amazon S3 seems as good a candidate as any.
Trouble is, I don't have a local server large enough to hold a local copy of these files. Ideally, I'd want to sync right from Azure Blob to S3. We're adding new files continually, so the sync would need to be frequent as well (multiple times per day).
I see lots of options for download from Azure to local => upload from local to S3, but very little for direct Azure => S3 sync. What are some good options here?

We can migrate the azure storage data to amazon s3 by node.js package.
You can see the full description provided here.
You can also use azure data factory to replicate as it provides a copy tool which can be modified according to your needs and settings for transferring data .
You can refer to this document on Azure data factory and copy tool.

Related

Google Cloud Data Fusion, How can I load many tables to bigquery in one pipeline

I want to load many tables which is in aws rds mysql server by using cloud data fusion. each table storage is more than about 1gb. also I found the plugin which name is "multiple database table" to load multi table. but i got a fail. Also basically when I used database source I can check my tables' schema. However, in multiple database table, i can 't find how to check table's schema. how can i use this plugin? or is there any other way to load many tables in data fusion service?
My pipeline setting was as follows.
I'm posting this Community Wiki as OP didn't provide enough details to reproduce but the below information might help someone.
There are few ways to get your data using Cloud Data Fusion, you can use pipeline, plugin, driver and a few others depending on your needs.
On the internet you can find two very well described guides with examples.
If you would like to find some information about Cloud Data Fusion with GCP products you should read Bahadir Bulut guide - How I used Google Cloud Data Fusion to create a data warehouse - Part 1 and Part 2. Also Data Fusion allows to use 150+ preconfigured connectors and transformations like Amazons S3, SQS, etc. Azure services and many more.
Another well described (which I guess would help OP) is to configure both Amazon and GCP resources and using pipelines. This guide is Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 1: Setting UP AWS Data pipeline and second part Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 2: Setting up BigQuery Transfer Service and Scheduled Query.. In short this guide describes 2 main steps:
Extract data from MYSQL RDS and bring into S3 using AWS data pipeline service
From S3, bring the file inside Bigquery using BigqQuery transfer service.

Handling pictures, documents, etc. (Microsoft Azure)

I am currently in the process of building a SQL database in Microsoft Azure for handling pictures, documents, etc. What is the most efficient/best way of storing data? Uploading the files directly to the DB, or by sourcing the files from something like Azure BLOB? I have read numerous posts about people uploading it directly to the DB, but I am concerned about its efficiency.
Thank you in advance for any replies.
You can store in something like Azure SQL DB for example but I would not recommend it, you should definitely store in Azure Storage (BLOB) and then for reference store in a DB. Azure has multiple relational and NoSQL data stores which are offered as platform services.
I would do two things, use a NoSQL platform data store like Cosmos DB using SQL Core API to store the metadata for the images, here you can use the filename as the partition ID to do a point read (this is very fast read and it would be a very cheap option with blazing fast performance) and secondly I would use Azure CDN to make sure images are accessed via CDN so that they are faster.
Azure CDN has three options; Akamai, Verizon and Microsoft. You can test which CDN is faster from where you are from here: https://cloudharmony.com/speedtest-for-azure
Using the above URL you can also use to test which Azure region is closer to you so to use that region, or test for your end-users and choose the region closer tot them.
I would say storing in Azure BLOBs is a better idea. Imagine you have 100 GB files stored in DB.
It will slow down your query if your table is not designed properly.
Backup & Restore DB will be very slow.
Azure DB is more expensive than Azure BLOB for the same size.
If your total file size is small enough, it doesn't make much difference.

Transfer a file from a computer to an Azure VM

I have a vb.net application connected with an sql server. This applications handles files.
Recently, this application connected with an sql server, which is in a VM of Azure.
My question is, how i can hanndle the files?
I want my application to upload(over internet) the files somewhere and then server side to haddle where these files will be saved. And the opposite.
Can you tell me what options i have? I don't want OneDrive.
Depending on the kinds of files you store and the way your application handles them, you have multiple options with Azure. These are Azure Blob Storage ( with blob types: Block, Append, and Page), Azure Files, or Azure Data Lake Store.
Azure Blob Storage:
The following blob types are great of your data is unstructured.
Block Blobs: for use of binary data or text. You store in blocks that can be manged.
Page Blobs: to store random access files, good for storing VHDs that are backing up VMs.
Append Blobs: similar to block blobs but are append-only and optimized for append-only workloads. Good for storing log files storage.
If you handle files using native file systems APIs and want to "lift and shift" your application as is, Azure Files might be your best option which uses the SMB protocol.
Another option you might want to try, which is in preview (not generally available yet ) is Azure Data Lake Store Gen 2 which allows you to interact with Azure Blob storage through a file system interface.
From the way you describe your application, I doubt you want to use Azure Disks service. Here is a comparison table to help you decide: https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks?toc=%2fazure%2fstorage%2fblobs%2ftoc.json

Backup options or snapshots of Google Cloud Storage data?

I pulled data into Google BigQuery tables and also generate some new datasets based on these data daily.
These original data and generated datesets, I would save in Google Cloud Storage for two purposes,
These are the backup copy of my Google BigQuery data.
Also some of these datasets saved in Google Cloud Storage would be dump loaded to AWS elasticsearch (so they are also the backup copy data for AWS Elasticsearch)
BigQuery or AWS Elasticsearch may only keep 2 months to 1 year data. So the data older than that, I only have one copy on Google Cloud Storage. (I need to have some backup options, such as 1 months snapshots for Google Cloud Storage which I can go back to if needed.)
My questions are
How could I keep a backup or snapshot of Google Cloud Storage data to prevent the data loss in Google Cloud Storage. Such as let me at least trace back 7 days or 1 months of the data in Google Cloud Storage?
So in the case of data lost, (accidentally delete data etc), I can go back a few days to get the data back.
Thanks!
You can backup your cloud data to some local storage, CloudBerry has option "Cloud to Local".
I can recommend the software I am using myself- Cloudberry backup that can backup cloud storage to local storage or to other cloud storage.The toolsupports various cloud storages i.e.Amazon, Google, Azure etc. You can also download and upload data with the help of the tool, thus it's better to install it on Google VM.

Synchronize Amazon RDS with Google BigQuery

People, the company where I work has some MySQL databases on AWS (Amazon RDS). We are making a POC with BigQuery and what I am researching now is how to replicate the bases to BigQuery (the existing registers and the new ones in the future). My doubts are:
How to replicate the MySQL tables and rows to BigQuery. Is there any tool to do that (I am reading about Amazon Database Migration Service)? Should I replicate to Google Cloud SQL and than export to BigQuery?
How to replicate the future registers? Is possible to create a job inside MySQL to send the new registers after a predefined number? For example, after 1,000 new rows are inserted (or a time is passed), some event is "triggered" and the new registers are copied to Cloud SQL/BigQuery?
My initial idea is to dump the original base, load it to the other and use a script to listen to new registers and send them to the new base.
Have I explained it properly? Is it understandable?
You will need to use one of the ETL tools which have integration with both mySQL and BigQuery to perform initial transfer of the data and copy subsequent changes to BigQuery. Take a look on the list of available tools [1]
You can also implement your own tool by developing a process which will extract the data from mySQL to a CSV file and then load that file into BigQuery using data import [2]
[1] https://cloud.google.com/bigquery/third-party-tools
[2] https://cloud.google.com/bigquery/loading-data-into-bigquery
In addition to what Vadim said, you can try:
mysqldump to CSV files to s3 (I believe RDS allows that)
run "gsutil" Google Cloud Storage utility to copy data from s3 to GCS
run "bq load file.csv" to load the file to BigQuery
I'm interested in hearing your experience, so feel free to ping me in private.