Uploading a huge file to s3 (larger than my hard drive) - amazon-s3

I'm trying to upload a file to an S3 bucket, but the problem is that the file I want to upload (and also create) is bigger than what my hard drive can hold (I want to store a 500TB file on the bucket)
Is there any way to do so?
The file is generated, so I thought about generating the file as I go while it uploads, but I can't quite figure out how to do it.
Any help is appreciated :)
Thanks in advace

The Multipart Upload API allows you to upload a file in chunks, including on-the-fly content generation... but the maximum size of an object in S3 is 5 TiB.
Also, it costs a minimum of $11,500 to store 500 TiB in S3 for 1 month, not to mention the amount of time it takes to upload it... but if this is a justifiable use case, you might consider using some Snowball Edge devices, each of which has its own built-in 100 TiB of storage.

Related

upload a >4mb file to an Azure File Share in 2023

I need to upload files larger than 4mb to an Azure File Share.
Previously the guidance was to use the Date Movement Library The github page implies it's being abandoned / no longer worked on and the v12 libraries should be used instead, but it looks like the 4mb limit is still in place (see Azure Storage File Shares client library for .NET
What is the current way to upload files >4mb to a file share?
The maximum size of a file that can be uploaded in an Azure File Share is 4 TiB (Reference).
When you upload a file in a File Share, you have to upload them in chunks and the maximum size of each chunk can be 4MiB. I think this is where you are getting confused.
So, to upload a file larger than 4MB, what you would need to do is create an empty file using ShareFileClient.CreateAsync method and specify the size of the file there.
Once that is done, you would need to read the source file in chunks (max chunk size would be 4MB) and call ShareFileClient.UploadAsync method by passing the stream data read from the source file.

Having trouble uploading larger files in BigQuery

I started the last course of the Case Study. I started doing Case 1 from the Google Data Analytics Certificate with BigQuery SQL, but I am struggling to upload 202008 file because it is too much space "Local uploads are limited to 100 MB. Please use Google Cloud Storage for larger files."
Then I saw a video that I could reduce the size and save the excel file as Excel Binary Workbook (*.xlsb), but it still did not work. Did anyone face similar problems with Case 1 when uploading data? The error is when I changed the file from CVS to binary, which reduced the size from 93,213 to half of that.
enter image description here
Any help will be appreciated
enter image description here
enter image description here
Basically, you need to upload first the files to GCS, using preferably the gsutil, which allows you to use multipart to upload files, then you would be able to load it into BigQuery.

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html