Azure Blob Storage: CloudBlockBlob.DownloadToStream throws OutOfMemory exception - azure-storage

This happens on my Azure Storage Emulator (did not try on real Azure Storage yet). I'm saving files to Blob Storage. I don't have any problem with smaller files (e.g. <= 107 MB). However, for bigger files (e.g. >= 114 MB), I could upload the file without error, but I got out of memory exception when trying to download it.
public Stream GetStream(string fileName)
{
var blob = GetCloudBlobContainer().GetBlockBlobReference(fileName);
if (blob.Exists())
{
Stream stream = new MemoryStream();
blob.DownloadToStream(stream);
return stream;
}
return null;
}
The exception is thrown on call blob.DownloadToStream(stream).
How to fix this problem?
UPDATE:
Okay, I found a workaround for my case. Instead of returning stream, I can make it to save to a local file directly (I need it anyway) using blob.DownloadToFile(), which is working fine. However, I'm still interested in finding a solution to this problem.

MemoryStream stores all your data in memory. And the fact that DownloadToFile works for you means that your machine might be running out of memory when trying to store blobs in memory.
As for uploads, If you upload directly from a file on your file system to blob, we do not load the whole file in memory and so you will not hit the same problem as in download.

In addition to Vinay's answer above, I would make a suggestion that you alternately open the "Performance" & "Processes" tab on your Windows Task manager to monitor memory usage while downloading.

Related

unexpected end of while when zlib.gunzipping very large file from s3 bucket

I'm loading a very large file from an AWS s3 bucket, creating a readStream out of it, gunzipping it, and then passing it through Papa.parse. When I say "very large file", I mean it's 245 MB gzipped, and 1.1 GB unzipped. Doing this with smaller files has always worked flawlessly, but with this excessively large file, it sometimes succeeds, but most often failed.
When it fails, zlib.createGunzip() throws an "end of file" error. (Apparently something it likes to do; I'm finding lots of references to this everywhere, but nothing that fits my case.)
Clearly it can succeed. Sometimes it does. I don't know what causes it to fail. Random memory shortage? Buffering where gunzip wants to read faster than the file can be loaded? I have no idea.
const file = await readS3File(s3options);
return new Promise(((resolve, reject) => {
Papa.parse(file.createReadStream().pipe(zlib.createGunzip()), { ...papaParseOptions });
I'm looking for a way to increase my chances of success. Is there a good way to do that? Some way to pipe this through a buffer that will retry loading the file while appeasing gunzip somehow? Am I looking in the wrong direction?

MaxMind: loading GeoIP2 mmdb into memory for fast read

I am using MaxMind's GeoIP2 to get the geo information for an IP address. In my Java web application, the
DatabaseReader reader = new DatabaseReader.Builder(new File("C:\GeoLite2-City.mmdb").withCache(new CHMCache()).build();
I am hoping to load the entire file into memory for efficient/fast read.
Is the way shown above the most efficient/fast way of using the mmdb database?
The code you pasted will memory-map the file and use the data cache. It should be efficient, but it will not load the whole database into memory. If you want to do that, you would need to load the database using the fileMode builder option, e.g.:
DatabaseReader reader = new DatabaseReader
.Builder(new File("C:\GeoLite2-City.mmdb")
.fileMode(com.maxmind.db.Reader.FileMode.MEMORY)
.withCache(new CHMCache())
.build();
However, in most cases, you will probably not see a performance difference between this and the memory-mapped file.

Intercepting File Writes on OS X

I have a program that generates information from the contents of files, however, I believe it would be more efficient if I were able to do this as the files are being written; rather than having to then read the contents back after some delay, since I can simply generate the data as the file is writing to disk.
What method(s) are available for an application to hook into the file-write process, i.e- to process the data stream as it's being written to disk? Also, which of these (if any) are allowable for app store apps?
I've been considering using a Spotlight Importer, however this still involves reading the contents of a file after they've been written, in which case I'm relying on the file still being in the RAM cache to reduce disk access.

PDFNet SDK Convert files on Azure storage

I have a web app that needs to convert PDFs to XODs (PDFTron’s format to display documents in their WebViewer). My Web App is hosted on Azure, the PDFs are on Azure Storage. We would like to go along with the on-premises conversion via PDFNet SDK (http://www.pdftron.com/webviewer/getstarted.html, see “Choosing a deployment model for conversions), my code so far is the following:
WebRequest req = HttpWebRequest.Create("url of PDF on Azure Storage here");
using (Stream stream = req.GetResponse().GetResponseStream())
{
PDFDoc pdfdoc = new PDFDoc(stream);
var converted = pdftron.PDF.Convert.ToXod(pdfdoc);
//pdfdoc.Save(stream, pdftron.SDF.SDFDoc.SaveOptions.e_linearized); //not clear to me
}
My approach here is to create a stream from the file on azure storage and convert that stream to XOD. I still don’t know if I should call “Save” and in that case where the file would be saved.
My questions are:
Since everything runs on the cloud does it make sense to use the CloudAPI instead of the self-hosted solution or does it not make any difference?
In both cases, where is the converted file stored (since I am getting it from the Azure storage and not from a local server), since I would then need to move it to my azure storage account. Does the file get saved locally (meaning on the web/worker role which is processing it) and therefore needs to be moved to the storage?
Here (http://www.pdftron.com/pdfnet/samplecode.html) there are conversion code samples but they all use files on local machine, which would not be my case.
Since everything runs on the cloud does it make sense to use the
CloudAPI instead of the self-hosted solution or does it not make any
difference?
In both cases, where is the converted file stored [...]
If you were to go with the cloud solution, you would transfer your files to PDFTron's servers, where they will be converted. Then you would download the converted files.
If you were to go with the on-premises solution, you would need to run DocPub CLI (https://www.pdftron.com/docpub/downloads.html) on your Azure instance, and its only communication with PDFTron would be to increment the billing counter for your PWS account (https://www.pdftron.com/pws/index.html).
You'd have to decide for yourself which solution works best for you.
Here (http://www.pdftron.com/pdfnet/samplecode.html) there are
conversion code samples but they all use files on local machine, which
would not be my case.
[Note: these samples show how to use the PDFNet SDK to run conversions. To run PDFNet you would need an additional license. So you probably want to use DocPub CLI or the cloud converter instead.]
The samples show how to convert the files locally, since XOD conversion would need to be run server-side. How most people do so is by setting up some web service to upload PDF (or other format) files. Then they convert the documents server-side, and place the converted XOD files someplace where the WebViewer can serve them.
After some extra research I found out I can get and set the Stream of the source and destination files (even if the destination file does not exist yet) directly on Azure without downloading the file. The resulting code is then something like
using (var sourceStream = sourceBlob.OpenRead())
{
var destinationContainer = BlobClient.GetContainerReference(projectKey);
var destinationBlob = destinationContainer.GetBlockBlobReference(xodName);
using (var destinationStream = destinationBlob.OpenWrite())
{
var pdfDoc = new PDFDoc(sourceStream);
pdftron.PDF.Convert.ToXod(pdfDoc);
pdfDoc.Save(destinationStream, pdftron.SDF.SDFDoc.SaveOptions.e_linearized);
});
});

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html