Need suggestion on usage of cloud file storage system, where reads are more than writes - amazon-s3

I need a cloud service for saving file objects for my project.
Requirements:
1. Pushing the files would be less compared to reading files from the storage.
2. Need to maintain versions of the file objects
3. File objects must be indexed for fast retrieval.
Note:
We have already considered using Amazon S3 bucket, but considering our project requirement, as reading the file objects would be 1 million times more than writing a file into storage.
As Amazon charges S3 usage more on the number of reads that writes it really is the last option for us to use.
Can anybody kindly provide suggestions on what can we use here?
Thanks!

Related

How to load large text file in Google BigQuery

I was going through Google BigQuery documentation and I see there is a limit of 5TB file capacity for unencrypted file load and 4TB for the encrypted file load in BigQuery, with 15TB per load job.
I have a hypothetical question - How can I load a text file larger than 16TB (assuming encryption will bring it in the range of 4TB)? I also see the GCS Cloud storage limit is 5TB per file.
I have never done it but here is how I think of possible approach but not sure and looking for confirmation. First, we will have to split the file. Next, we have to encrypt them and transfer them to GCS. Next, load them in the Google BigQuery table.
You are on the right track I guess. Split the file into smaller chunks, then go ahead and distribute them within 2 or 3 different GCS buckets.
Once the chunks are there in the buckets, you can go ahead and load them into BQ.
Hope it helps.

Lambda triggers high s3 costs

I created a new Lambda based on a 2MB zip file (it has a heavy dependency). After that, my S3 costs really increased (from $12.27 to $31).
Question 1: As this is uploaded from a CI/CD pipeline, could it be that it's storing every version and then increasing costs?
Question 2: Is this storage alternative more expensive than choosing directly an owned s3 bucket instead of the private one owned by Amazon where this zip goes? Looking at the S3 prices list, only 2MB can't result in 19 Dollars.
Thanks!
few things you can do to mitigate cost:
Use Lambda Layers for dependencies
Use S3 Infrequent access for your lambda archive
Being that I don't have your full configuration of S3, its hard to tell what can be causing cost...things like S3 versioning would do it.
The reason was that Object versioning was enabled, and after some stress tests those versions were accumulated and stored. Costs went back to $12 after they were removed.
It's key to keep the "Show" enabled (see image) to keep track of those files.

any storage service like amazon s3 which allows upload /Download at the same time on large file

My requirement to upload large file (35gb), when the upload is in progress need to start the download process on the same file. Any storage service which allows develop .net application
Because Amazon s3 will not allow simultaneously upload and download on
You could use Microsoft Azure Storage Page or Append Blobs to solve this:
1) Begin uploading the large data
2) Concurrently download small ranges of data (no greater than 4MB so the client library can read it in one chunk) that have already been written to.
Page Blobs need to be 512 byte aligned and can be read and written to in a random access pattern, whereas AppendBlobs need to be written to sequentially in an append-only pattern.
As long as you're reading sections that have already been written to you should have no problems. Check out the Blob Getting Started Doc: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/ and some info about Blob Types: https://msdn.microsoft.com/library/azure/ee691964.aspx
And feel free to contact us with any follow up questions.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

Moving 1 million image files to Amazon S3

I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
All new images go to Amazon S3.
Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.