Why can't we just use arraybuffer and convert them to int array to upload file? - file-upload

I got this silly question which originates from my college assignment.
Basically what I was trying to do at that time is to upload an image to a flask backend in REST way and the backend will use open-cv to do a image recognition. Because json data type does not support binary data, I follow some online instructions to use base64 which is of course feasible(it seems to be used a lot in terms of file uploading for REST, not sure about the behind reason). But Later I realized actually I can read the image into ArrayBuffer and convert it to int array and then post to the backend. I just tried it today and it succeeded. Then on both sides, the encoding overhead is avoided and payload size also get reduced since base64 increases size by around 33%.
I want to ask since we can avoid using based64 why we still use base64. Is it just because it avoids issues of line ending encodings across systems? It seems unrelated to binary data uploading.

Related

PDF Entropy calculation

Last time mkl helped me a lot, hopefully he (or someone else) can help me with these questions too. Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
Are these bytes used for padding? I have tried several files, and they all have padding characters. This is quite remarkable, as I would expect that this substantial amount of low entropy bytes should significantly lower the average entropy of the PDF file. However, this does not seem to be the case, as the average entropy of a PDF file is almost eight bits)
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?) (magenta = high entropy/random, how darker the color, how lower the entropy, In generated this image with http://binvis.io/#/)
These are the entropy values ​​of a .doc file (**not **docx), that I converted to a PDF with version 1.4, as this version should not contain object streams etc. However, the entropy values ​​of this file are still quite high. I would think that the entropy of a PDF with version <1.5 would have a lower entropy value on average, as it does not use object streams, but the results are similar to a PDF with version 1.5
I hope somebody can help me with these questions. Thank you.
Added part:
The trailer dictionary has a variable length, and with PDF 1.5 (or higher) it can be part of the central directory stream, not only the length but also the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested). The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer). However, when I encrypt a file with Adobe Acrobat Pro, the structure of the PDF file is preserved. This makes sense, since the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
PDF Header encrypted with EasyCrypt:
PDF Header encrypted with Adobe Acrobat Pro:
Unfortunately I couldn't get access to the ISO norm (ISO 32000-1 or 32000-2).
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
Are these bytes used for padding?
Those bytes are part of a metadata stream. The format of the metadata is XMP. According to the XMP spec:
Padding It is recommended that applications allocate 2 KB to 4 KB of padding to the packet. This allows the XMP to be edited in place, and expanded if necessary, without overwriting existing application data. The padding must be XML-compatible whitespace; the recommended practice is to use the space character (U+0020) in the appropriate encoding, with a newline about every 100 characters.
So yes, these bytes are used for padding.
Furthermore, this (meta)data should be part of an object stream, and therefore compressed, but this is not the case (Is there a specific reason for this)..?)
Indeed, there is. The pdf document-wide metadata streams are intended to be readable by applications, too, that don't know the PDF format but do know the XMP format. Thus, these streams should not be compressed or encrypted.
...
I don't see a question in that item.
Added part
the position/offset of the trailer dictionary can vary (although is it.. because it seems that even if the trailer dictionary is part of the central directory stream, it is always at the end of the file?, at least... in all the PDFs I tested)
Well, as the stream in question contains cross reference information for the objects in the PDF, it usually is only finished pretty late in the process of creating the PDF an, therefore, added pretty late to the PDF file. Thus, an end-ish position of it usually is to be expected.
The only thing I don't really understand is that for some reason the researchers of this study assumed that the trailer has a fixed size and a fixed position (the last 164 bytes of a file).
As already discussed, assuming a fixed position or length of the trailer in general is wrong.
If you wonder why they assumed such a fixed size nonetheless, you should ask them.
If I were to guess why they did, I'd assume that their set of 200 PDFs simply was not generic. In the paper they don't mention how they selected those PDFs, so maybe they used a batch they had at their hands without checking how special or how generic it was. If those files were generated by the same PDF creator, chances indeed are that the trailers have a constant (or near constant) length.
If this assumption is correct, i.e. if they worked with a not-generic set of test files only, then their results, in particular their entropy values and confidence intervals and the concluded quality of the approach, are questionable.
They also mention in Figure 8 that a PDF file encrypted by EasyCrypt, has some structure in both the header and the trailer (which is why it has a lower entropy value compared to a PDF file encrypted with ransomware).
However, when I encrypt a file with EasyCrypt (I tried three different symmetric encryption algorithms: AES 128 bit, AES 256 bit and RC2) and encrypt several PDF files (with different versions), I get a fully encrypted file, without any structure/metadata that is not encrypted (neither in the header nor in the trailer).
In the paper they show a hex dump of their file encrypted by EasyCrypt:
Here there is some metadata (albeit not PDF specific) that should show less entropy.
As your EasyCrypt encryption results differ, there appear to be different modes of using EasyCrypt, some of which add this header and some don't. Or maybe EasyCrypt used to add such headers but doesn't anymore.
Either way, this again indicates that the research behind the paper is not generic enough, taking just the output of one encryption tool in one mode (or in one version) as representative example for data encrypted by non-ransomware.
Thus, the results of the article are of very questionable quality.
the PDF extension has its own standardised format for encrypting files, but I don't really understand why they mention that EasyCrypt conforms to this standardised format.
If I haven't missed anything, they merely mention that A constant regularity exists in the header portion of the normally encrypted files, they don't say that this constant regularity does conform to this standardised format.

What is the size limit for JsonItemExporter in Scrapy?

The following warning is mentioned in the Feed Exports section of Scrapy docs.
From the docs for JsonItemExporter:
JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or splitting the output in multiple chunks.
Does this mean that the JsonItemExporter is not suitable for incremental (aka stream data) or does it also imply a size limit for json?
If this means that this exporter is not suitable also for large files, does anyone have a clue about the upper limit for json items / file size (for e.g. 10MB or 50MB)?
JsonItemExporter does not have a size limit. The only Limitation remains to be no support for streamable objects.

The joys of realtime gif decoding

I have written this gist to convert gif data downloaded from the this incredible site and turn that into a mp4 that is playable with MPMoviePlayerController.
The problem is, I must download the entire gif and then start converting. I would like to convert as I get data and I believe, but can't confirm that, it might be possible with CGImageSourceCreateIncremental by passing in the data returned by - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data.
Has anyone attempted this or know of an example?
I know that incremental works for jpegs and pngs. The coding of it is slightly complex but you should be able to find sample code here or on Apple's site. No way to know what it will do with animated gifs. If it does not do what you need you can most certainly find an open source libgif library, which you'd obviously need to port to iOS. I have used libjpeg turbo in a project to get incremental decoding for JPEGs.
The one problem with the Quartz method you reference is it always incrementally decodes from the image beginning. So, if you have a really big image as I did, as the size of the download increases the longer that each request for an incremental image takes. I actually burned a DTS incident on this - Apple engineer agreed it would be nice if they'd cache the intermediate image so each request would build on the last image, but they don't now.

How to avoid imageresizing if width and height is same as original?

Is there a way (param) to avoid Imageresizing to process an image if the height and width are the same as original ?
If not, where and how do I cancel the scale process in a plugin ?
The ImageResizer does not scale images that are already the appropriate size. It does, however, decode them, strip metadata, and re-encode them into a web-compatible and web-efficient format (usually jpg or png).
If you're wanting the ImageResizer to serve the original file, skipping the entire process, that's a different question, which I'll try to answer below.
Here's the primary challenge with that goal: To discover the width and height of the source image file, you must decode it - at least partially.
This optimization is only useful (or possible) in limited circumstances
The format of the source file allows you to parse the width & height without loading the entire file into memory. JPG/PNG yes, TIFF - no, 30+ formats supported by the FreeImageDecoder, no.
The source file is on local, low-latency disk storage, and is IIS accessible - eliminates UNC path and plugins S3Reader, SqlReader, AzureReader, RemoteReader, MongoReader, etc.
No URL rewriting rules are in place.
No custom plugins are being used.
The image is already in a web-optimized format with proper compression settings, with metadata removed.
No other URL commands are being used
No watermarking rules are in place.
You do not need to control cache headers.
You are 100% sure the image is not malicious (without re-encoding, you can't ensure the file can't be both a script and a bitmap).
In addition, unless you cached the result, this 'optimization' wouldn't, in fact, improve response time or server-side performance. Since the dimension data would need to be decoded separately, it would add uniform, significant overhead to all requests whether or not they happened to have a dimension match.
The only situation in which I see this being useful is if you spend a lot of time optimizing compression in Photoshop and don't want the ImageResizer to touch it unless needed. If you're that concerned, just don't apply the URL in that scenario. Or, set process=no to keep the original bytes as-is.
It's definitely possible to make a plugin to do this; but it's not something that many people would want to use, and I can't envision a usage scenario where it would be a net gain.
If you want to plunge ahead, just handle the Config.Current.Pipeline.PreHandleImage event and replace e.ResizeImageToStream with code that parses the stream returned by e.GetSourceImage(), apply your dimension logic (comparing to Config.Current.GetImageBuilder().GetFinalSize(), then reset the stream and copy it verbatim if desired like this:
using (Stream source = e.GetSourceImage())
StreamExtensions.CopyToStream(source,stream); //4KiB buffer
That might not handle certain scenarios, like if the image actually needs to be resized 1px smaller, but you're adding 1 px border, etc, but it's close. If you're picky, look at the source code for GetFinalSize and return the image bounds instead of the canvas bounds.

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.