I need to create a service that returns a GZipStream consisting of one or more files. The number of files could be hundreds and each file could potentially take up more than 500MB.
Is it somehow possible to add the files dynamically to the gzipstream as the stream is being transfered? (to avoid running into an out-of-memory exception when the files needs to be copied into the stream)
Etc:
Copy fileA to the stream being returned.
The client starts reading the stream.
When fileA has been read (client side), copy fileB to the stream (server side).
The client continue to read the stream.
... and so on until there's no more files.
Btw. it's not important that the files are compressed, just that they are combined into a zip file so that the client only has to download one single file.
So my goal is: Stream multiple files back to the client as one single file without processing all the files at once on the server (to avoid loading all files into memory and therefore raise an out-of-memory exception).
Could this be done by creating a custom stream somehow or is there an easier way to go?
Thanks.
You could combine the files to a single zip file on disk and then stream that file back.
For how to combined the files to a zip file see: c# sharpziplib adding file to existing archive
This solves the out of memory problem, but it does mean that you need a lot of disk space.
Related
I am using Azure data Lake Store for files Storage. I am using operations like
Creating a main file
Creating part files
Appending these part files to main file (for Concurrent append)
Example:
There is main log file (eventually will contain logs from all
programs)
There are part log file that each program creates solely and then
append to the main log file
The workflow runs really file but i have noticed some unknown file getting uploaded onto the store directory. These files name is a GUID an has no extension, moreover these unknown files are empty.
Does anyone knows what might be the reason for these extra files.
Thanks for reformatting your question. This looks like some processing artefacts that probably will disappear shortly after. How did you upload/create your files?
I have a program that generates information from the contents of files, however, I believe it would be more efficient if I were able to do this as the files are being written; rather than having to then read the contents back after some delay, since I can simply generate the data as the file is writing to disk.
What method(s) are available for an application to hook into the file-write process, i.e- to process the data stream as it's being written to disk? Also, which of these (if any) are allowable for app store apps?
I've been considering using a Spotlight Importer, however this still involves reading the contents of a file after they've been written, in which case I'm relying on the file still being in the RAM cache to reduce disk access.
I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html
I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.
it is possible to use DotNetZip to create a zip from an accessed file (eg log file from another application) ?
so create a zip when the log file gets written through the other application
Hmm, well, yes, if you are willing to write some code.
One way to do it is to compress the file AFTER it has been written and closed.
You would need to have an app that runs with a filesystem watcher, and when it sees the log file being closed, it compresses that log file into a zip.
If you mean to imply, a distinct app that writes to a file and it automagically writes into a zip file, no I don't know of a simple way to do that. There is one possibility: if the 3rd party app accepts a System.IO.Stream in which to write the log entries. In that case, you can do that with DotNetZip. You can get a writeable stream from DotNetZip, into which the app writes content. It is compressed as it is written, and when the writing is complete, DotNetZip closes the zipfile. To use this, check the ZipFile.AddEntry() method that accepts a WriteDelegate. It's in the documentation.