PDFNet SDK Convert files on Azure storage - pdf

I have a web app that needs to convert PDFs to XODs (PDFTron’s format to display documents in their WebViewer). My Web App is hosted on Azure, the PDFs are on Azure Storage. We would like to go along with the on-premises conversion via PDFNet SDK (http://www.pdftron.com/webviewer/getstarted.html, see “Choosing a deployment model for conversions), my code so far is the following:
WebRequest req = HttpWebRequest.Create("url of PDF on Azure Storage here");
using (Stream stream = req.GetResponse().GetResponseStream())
{
PDFDoc pdfdoc = new PDFDoc(stream);
var converted = pdftron.PDF.Convert.ToXod(pdfdoc);
//pdfdoc.Save(stream, pdftron.SDF.SDFDoc.SaveOptions.e_linearized); //not clear to me
}
My approach here is to create a stream from the file on azure storage and convert that stream to XOD. I still don’t know if I should call “Save” and in that case where the file would be saved.
My questions are:
Since everything runs on the cloud does it make sense to use the CloudAPI instead of the self-hosted solution or does it not make any difference?
In both cases, where is the converted file stored (since I am getting it from the Azure storage and not from a local server), since I would then need to move it to my azure storage account. Does the file get saved locally (meaning on the web/worker role which is processing it) and therefore needs to be moved to the storage?
Here (http://www.pdftron.com/pdfnet/samplecode.html) there are conversion code samples but they all use files on local machine, which would not be my case.

Since everything runs on the cloud does it make sense to use the
CloudAPI instead of the self-hosted solution or does it not make any
difference?
In both cases, where is the converted file stored [...]
If you were to go with the cloud solution, you would transfer your files to PDFTron's servers, where they will be converted. Then you would download the converted files.
If you were to go with the on-premises solution, you would need to run DocPub CLI (https://www.pdftron.com/docpub/downloads.html) on your Azure instance, and its only communication with PDFTron would be to increment the billing counter for your PWS account (https://www.pdftron.com/pws/index.html).
You'd have to decide for yourself which solution works best for you.
Here (http://www.pdftron.com/pdfnet/samplecode.html) there are
conversion code samples but they all use files on local machine, which
would not be my case.
[Note: these samples show how to use the PDFNet SDK to run conversions. To run PDFNet you would need an additional license. So you probably want to use DocPub CLI or the cloud converter instead.]
The samples show how to convert the files locally, since XOD conversion would need to be run server-side. How most people do so is by setting up some web service to upload PDF (or other format) files. Then they convert the documents server-side, and place the converted XOD files someplace where the WebViewer can serve them.

After some extra research I found out I can get and set the Stream of the source and destination files (even if the destination file does not exist yet) directly on Azure without downloading the file. The resulting code is then something like
using (var sourceStream = sourceBlob.OpenRead())
{
var destinationContainer = BlobClient.GetContainerReference(projectKey);
var destinationBlob = destinationContainer.GetBlockBlobReference(xodName);
using (var destinationStream = destinationBlob.OpenWrite())
{
var pdfDoc = new PDFDoc(sourceStream);
pdftron.PDF.Convert.ToXod(pdfDoc);
pdfDoc.Save(destinationStream, pdftron.SDF.SDFDoc.SaveOptions.e_linearized);
});
});

Related

Exporting data over an api from s3 using lambda

I have some data stored in dynamo db and some highres images of each user stored in S3. The requirement is to be able to export a users data on demand. So by an api endpoint, collate all data and send it as a response. We are using aws lambda using node.js for business logic, s3 for storing images and sql db for storing relational data
I had set up a mechanism to connect api gateway to receive requests and put them in a sqs. The sqs would trigger a lambda which would run queries to gather all the data and image paths. We would copy all the images and data into a new bucket with custId as a folder name. Now heres where Im stuck. How to stream this data from our new aws bucket. All collected data is about 4gb. I have tried to stream via aws-lambda but keep failing. I am able to stream sigle files but not all data as zip. Hv done this in node, but would not want to set up an EC2 is possible and try to solve it directly with s3 and lambdas
CAnt seem to find a way to stream an entire folder from aws to the client as a response to an http request
Okay found the answer. Instead of trying to return a zip stream, Im now just zipping and saving the folder on the bucket itself and returning a signed url for it. Many node modules help us zip s3 folders without loading entire files in memory. Using that we have zipped our folder and returned a signed url. How it will behave under actual load remains to be seen. Will do that soon

How to read encrypted and gzipped blob data from u-sql

I would like to read file from a blob that is first compressed (gz) and then encrypted. The encryption done using Azure SDK when file uploaded to Blob (BlobEncryptionPolicy passed to CloudBlockBlob.UploadFromStreamAsync method).
There blob file have .gz extension so U-SQL trying to decompress but fails as the file is encrypted.
Is it possible to set my u-sql script to handle the decompression automatically same as done by Azure SDK (for instance by CloudBlockBlob.BeginDownloadToStream)?
If not and I need to use custom extractor, how can I prevent the U-SQL from trying to decompress the stream automatically?
The decompression is automatically triggered by the ".gz" extension. So you would have to rename the document. Also, please note that you cannot call to any external resource to decrypt from within your user-code. You will have to pass all keys as parameters to the custom extractor.
Finally, if you store the data in ADLS, you get transparent encryption of the data and it makes the whole thing a lot easier. Why are you storing it in Windows Azure Blob Storage instead?

Creating thumbnails for images on S3

I have quite common situation, as I suppose. I have website that is lcoated on amazon EC2 and I'd like to move all dynamic files to amazon S3. Everything seems ok, except 2 points:
I'm using library PDFNet with their WebViewer. To display pdf files in browser Webviwer use special ".xod" format. PDFNet provide functionality to convert pdf files to xod format. Let's see an example, when PDF file was upload on S3 and no xod file was created (I'm going to use Lambda to avoid it in future, but still). So in this case I have to download file to my local machine, convert it to xod file and upload xod file on S3(I don't see any other opportunities to do it, but it can take a lot of traffic)?
Second problem is almost the same, but it's linked with thumbnails. Currently I'm dynamically resize thumbnails depending on the required resolution and I'd like to keep it. Amazon Lambda is not situable in this case, what is the best way to do it?
Why do you say that Lambda is not suitable here?
For pt#1 PDFNet gives a library for Java, you can write a lambda function in java (its possible now) and use that to get infinite scale.
For pt#2: Amazons tutorial (http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html) gives a detailed example of how to resize images when uploaded to S3. The example is in nodeJs, you can write a java version as well if you like.
Note that if you want to have custom logic for decision making, you can add attributes while uploading the file in S3 (http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#User-Defined Metadata) which you can use in your lambda function to take decisions while resizing.

Azure Blob Storage: CloudBlockBlob.DownloadToStream throws OutOfMemory exception

This happens on my Azure Storage Emulator (did not try on real Azure Storage yet). I'm saving files to Blob Storage. I don't have any problem with smaller files (e.g. <= 107 MB). However, for bigger files (e.g. >= 114 MB), I could upload the file without error, but I got out of memory exception when trying to download it.
public Stream GetStream(string fileName)
{
var blob = GetCloudBlobContainer().GetBlockBlobReference(fileName);
if (blob.Exists())
{
Stream stream = new MemoryStream();
blob.DownloadToStream(stream);
return stream;
}
return null;
}
The exception is thrown on call blob.DownloadToStream(stream).
How to fix this problem?
UPDATE:
Okay, I found a workaround for my case. Instead of returning stream, I can make it to save to a local file directly (I need it anyway) using blob.DownloadToFile(), which is working fine. However, I'm still interested in finding a solution to this problem.
MemoryStream stores all your data in memory. And the fact that DownloadToFile works for you means that your machine might be running out of memory when trying to store blobs in memory.
As for uploads, If you upload directly from a file on your file system to blob, we do not load the whole file in memory and so you will not hit the same problem as in download.
In addition to Vinay's answer above, I would make a suggestion that you alternately open the "Performance" & "Processes" tab on your Windows Task manager to monitor memory usage while downloading.

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.