unexpected end of while when zlib.gunzipping very large file from s3 bucket - amazon-s3

I'm loading a very large file from an AWS s3 bucket, creating a readStream out of it, gunzipping it, and then passing it through Papa.parse. When I say "very large file", I mean it's 245 MB gzipped, and 1.1 GB unzipped. Doing this with smaller files has always worked flawlessly, but with this excessively large file, it sometimes succeeds, but most often failed.
When it fails, zlib.createGunzip() throws an "end of file" error. (Apparently something it likes to do; I'm finding lots of references to this everywhere, but nothing that fits my case.)
Clearly it can succeed. Sometimes it does. I don't know what causes it to fail. Random memory shortage? Buffering where gunzip wants to read faster than the file can be loaded? I have no idea.
const file = await readS3File(s3options);
return new Promise(((resolve, reject) => {
Papa.parse(file.createReadStream().pipe(zlib.createGunzip()), { ...papaParseOptions });
I'm looking for a way to increase my chances of success. Is there a good way to do that? Some way to pipe this through a buffer that will retry loading the file while appeasing gunzip somehow? Am I looking in the wrong direction?

Related

Silverstripe 4 large Files in Uploadfield

when uploading a large file with uploadfield i get the error.
"Server responded with an error.
Expected a value of type "Int" but received: 4008021167"
to set the allowed filesize i used $upload->getValidator()->setAllowedMaxFileSize(6291456000);
$upload is an UploadField.
every file larger than 2gb gets this error. smaller files are uploaded without any error.
where can i adjust that i can upload bigger files.
I remember that there has been a 2GB border in the past, but i don't know where to adjust it
thanks for your answers
klaus
The regular file upload limits don't seem to be the issue, if you are already at 2 GB. This might be the memory limit of the process itself. I would recommend looking into chunked uploads - this allows you processing larger files.
I know, this answer is late - but the problem is rooted in the graphQL type definition of the File type (it is set to Int). I've submitted a pull request to the upstream repository. Also here is the sed one-liner to patch it:
sed -i 's/size\: Int/size\: Float/g' vendor/silverstripe/asset-admin/_graphql/types/File.yml

Intercepting File Writes on OS X

I have a program that generates information from the contents of files, however, I believe it would be more efficient if I were able to do this as the files are being written; rather than having to then read the contents back after some delay, since I can simply generate the data as the file is writing to disk.
What method(s) are available for an application to hook into the file-write process, i.e- to process the data stream as it's being written to disk? Also, which of these (if any) are allowable for app store apps?
I've been considering using a Spotlight Importer, however this still involves reading the contents of a file after they've been written, in which case I'm relying on the file still being in the RAM cache to reduce disk access.

How chunk file upload works

I am working on file upload and really wandering how actually chunk file upload works.
While i understand client sends data in small chunks to server instead of complete file at once. But i have few questions on this:-
For browser to divide and send whole file into chunks, Will it read complete file to its memory? If yes, then again there will me chances of memory leak and browser crash for big files(say > 10GB)
How cloud application like google drive droopbox handles such big files upload?
If multiple files are selected to upload and all have size grater than 5-10 GB, Does browser keep all files into memory then send it chunk by chunk?
Not sure if you're still looking for answer, I been in your position recently, and here's what I've come up with, hope it helps: Deal chunk uploaded files in php
During uploading, If you can print out the request from the backend, you shall see three parameters: _chunkNumber, _totalSize and _chunkSize, with these parameters it's easy to decide whether this chunk is the last piece, if it is, assemble all of the pieces as a whole shouldn't be hard.
As for javascript side, ng-file-upload has a setting named "resumeChunkSize" where you can enable chunk mode and setup the chunk size.

Azure Blob Storage: CloudBlockBlob.DownloadToStream throws OutOfMemory exception

This happens on my Azure Storage Emulator (did not try on real Azure Storage yet). I'm saving files to Blob Storage. I don't have any problem with smaller files (e.g. <= 107 MB). However, for bigger files (e.g. >= 114 MB), I could upload the file without error, but I got out of memory exception when trying to download it.
public Stream GetStream(string fileName)
{
var blob = GetCloudBlobContainer().GetBlockBlobReference(fileName);
if (blob.Exists())
{
Stream stream = new MemoryStream();
blob.DownloadToStream(stream);
return stream;
}
return null;
}
The exception is thrown on call blob.DownloadToStream(stream).
How to fix this problem?
UPDATE:
Okay, I found a workaround for my case. Instead of returning stream, I can make it to save to a local file directly (I need it anyway) using blob.DownloadToFile(), which is working fine. However, I'm still interested in finding a solution to this problem.
MemoryStream stores all your data in memory. And the fact that DownloadToFile works for you means that your machine might be running out of memory when trying to store blobs in memory.
As for uploads, If you upload directly from a file on your file system to blob, we do not load the whole file in memory and so you will not hit the same problem as in download.
In addition to Vinay's answer above, I would make a suggestion that you alternately open the "Performance" & "Processes" tab on your Windows Task manager to monitor memory usage while downloading.

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html