How to copy many terabytes of data to Azure? - azure-storage

I am trying to copy 25 TB of data to Azure. Do we have any option to move the date?
Tried to copy but it has taken 1 hr for 1 GB Data, do we have any better solution so that I can do it more quickly?

The problem statement is very general. I would start with asking, how are you transferring the data?
The speed is dependent on so many factors, a few being:
1. Location of the data.
2. Location of the storage account you're writing to.
3. Network speed and bandwidth on the client side.
4. Network speed and bandwidth on the azure storage side. (expected to be good)
If you're writing the data to a Azure Storage account which is in a region closer to you, you're expected to get better speed.
As for the options to write the data:
1. Look at AzCopy.
https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/
Use Import\Export service.
https://azure.microsoft.com/en-us/pricing/details/storage-import-export/

The best way to upload large datasets into the cloud is still the sneakernet
Azure do a thing called the Azure Import/Export Service Basically you buy a SATA hard drive, encrypt it with a numerical bitlocker key, copy data to it, create an Azure import job, then ship the hard drive to them.
This ends up being considerably quicker than trying to upload.
An alternative you might want to look into, would be the AWS Import/Export Snowball for which they will ship you an appliance to copy the data to which you ship back to them when complete. It might be worth considering copying data into AWS via Snowball then copying it across their much faster internet pipes into Azure instead of buying the hardware required to transfer that much data.

If you open the target Storage account in the Azure Portal, there's now a calculator that will accept basic details (how much data etc) and then recommend the best options to you. Its under the heading "Data transfer".

Related

How to increase queries per minute of Google Cloud SQL?

As in the question, I want to increase number of queries per second on GCS. Currently, my application is on my local machine, when it runs, it repeatedly sends queries to and receives data back from the GCS server. More specifically, my location is in Vietnam, and the server (free tier though) is in Singapore. The maximum QPS I can get is ~80, which is unacceptable. I know I can get better QPS by putting my application on the cloud, same location with the SQL server, but that alone requires a lot of configuration and works. Are there any solutions for this?
Thank you in advance.
colocating your application front-end layer with the data persistence layer should be your priority: deploy your code to the cloud as well
use persistent connections/connection pooling to cut on connection establishment overhead
free tier instances for Cloud SQL do not exist. What are you referring to here? f1-micro GCE instances are not free in Singapore region either.
depending on the complexity of your queries, read/write pattern, size of your dataset, etc. performance of your DB could be I/O bound. Ensuring your instance is provisioned with SSD storage and/or increasing the data disk size can help lifting IOPS limits, further improving DB performance.
Side note: don't confuse commonly used abbreviation GCS (Google Cloud Storage) with Google Cloud SQL.

azure blob storage effectiveness

I'm creating an application that are gonna be involving a lot of pictures.
I am currently using Windows Azure Blob Storage. I know you're not supposed to store pictures on the database b.c. it takes so much space, instead just store the address and put the files on the disk somewhere on the server.
So I'm wondering if I'm heading into the right direction using Azure Blob?
How the speed will be? Would it be costly?
How hard would it be to migrate later on so I can store the files on a disk?
Please advice,
Thanks
That is precisely one of the main usage scenarios for the Azure blobs. There may be scenarios where something else is better, but for most cases that is what you are looking for.
Note depending in the usage it will have, you may enable the cdn service to make it perform best for users around the globe (if each image will be viewed lots of times).
If you end up deciding to move the files later out of the blob storage you can use a tool as cloud berry, or just make a few lines of code. The main thing as usual, would be about the code you put in the application for it; if it is well structured it should be fast to migrate as well.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

AWS s3 standard vs. reduced redundancy storage?

Does anyone know of any real-world analysis on data loss using these two AWS s3 storage options? I know from the AWS docs (via Quora) that one is 99.9999999% guarenteed and the other is only 99.99% gaurenteed but I'm looking for data from a non-AWS source.
Anecdotes or something more thorough would both be great. I apologize if this isn't the right SE site for this question. Feel free to suggest a place to migrate it.
I guess it depends on the data you're storing if you really need 99.999999999% level of durability …
If you keep copies of your data locally and are just using S3 as a convenient place to store data that is actively being accessed by services within the AWS infrastructure, RRS might be the right choice for you :)
In my case, I keep fresh files on the normal durability level till I created a local backup and then move them to RRS, which saves you quite a bit a money.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column