Convert binary streamse use ghostscript api - api

How do I convert a binary blob pdf located in memory, in the same place jpeg located. Uses pure ghostscript api interface. Example, gsdll32.dll.
The proposed interface uses the default files on the disk.
args = [
"-dFirstPage=10",
"-dLastPage=10",
"-sDEVICE=jpeg",
"-r300",
"-sOutputFile=book.jpg",
"-dNOPAUSE",
"test2.pdf"
]

Basically, you can't. The Ghostscript PDF interpreter assumes that its dealing with a file on disk (see for example the definitions of runpdf and runpdfbegin in pdf_main.ps). Possibly you could convert that into a stream and pass that but it looks like a lot of work to me, all in PostScript.
You definitely can't have the JPEG output written to the same memory location.

Related

Is it possible to obfuscate PDF file binary data?

Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.

What kind of encoding is this gibberish

This file I downloaded is supposed to be a PDF (I think, could be just a text file for all I know) but see the picture below for what the file looks like. Does anyone know what this is or if it can be converted?
If it's from a PDF file, it is likely to be Flate encoded (the same type of compression as is used with zip files, but no you cannot open a PDF file with a zip utility). This is the most common compression in a PDF for non-image data. It's not ASCIIHex or ASCII85 encoded. It could be but likely isn't LZW or RunLength (RLE) encoded. If it is image data, it could be CITTFax, JBIG2, DCT (essentially JPEG), or JPX (JPEG 2000) encoded.
In some cases, it is possible that parts of a PDF might be encoded by more than one of these filters, so a combination of say, DCT and ASCII85, could be used. but this isn't as common anymore.
Or the PDF file could be encrypted, in which case you have a choice of RC4 or different flavors of AES encryption. It's also possible that custom encryption was used (e.g. if the PDF file is an E-Book).
The screenshot you provided doesn't contain enough information to determine what the case may be for that particular part of the file, but the end conclusion you basically need to read your PDF file with software that understands the PDF format; a text editor won't do.

Conversion from PDF to TIFF file using XSLT

Is it possible to convert PDF to TIFF file using XSLT? Can someone point out some artcile or code i can refer regarding the image conversion using xslt.
THANKS!
No, it is not possible using just XSLT. XSLT is for transforming XML to other textual structures (usually XML, HTML, or plain text). Using XSL-FO, you can output a PDF from XML data - but that is a one way process as far as XSL-FO is concerned. Apache FOP does support outputting to TIFF instead of PDF, but again this is a one way process.
Assuming you could get a PDF -> XML conversion working (a quick google suggests such libraries exist, but it's unclear what they'd actually provide), it would be possible to use XSLT to transform that XML into something Apache FOP could render into a TIFF file, but at that point you'd really be better off investigating a direct PDF to TIFF conversion library (perhaps with an OCR library).
Possible? Maybe (but likely not). The real question is why do you even want to try to create a TIFF file from a PDF file using XSLT?
You do not need XSLT.
You want a raster image processor like Ghostscript (or many others). It can convert PDF (and Postscript) to other image formats like TIFF.
http://ghostscript.com/doc/current/Devices.htm
The only way to do that is to call a conversion service, e.g. aspose.com or to create another service externally to the DataPower box.
There might be some Node.js modules that could do it running in GatewayScript (GWS) (if you are on firmware 7+) but I believe they are all dependent on external binaries to function and that won't work in GWS.

Why does the combination pdf2ps / ps2pdf shrink the PDF?

When researching how to compress a bunch of PDFs with pictures inside (ideally in a lossless fashion, but I'll settle for lossy) I found that a lot of people recommend doing this:
$ pdf2ps file.pdf
$ ps2pdf file.ps
This works! The resulting file is smaller and looks at least good enough.
How / why does this work?
Which settings can I tweak in this process?
If there is some lossy conversion, which one is that?
Where is the catch?
People who recommend this procedure rarely do so from a background of expertise or knowledge -- it's rather based on gut feelings.
The detour of generating a new PDF via PostScript and back (also called "refrying a PDF") is never going to give you the optimal results. Sometimes it is useful, f.e. in cases were the original PDF isn't printed at all, or cannot be processed by another application. But these cases are very rare.
In any case, this "roundtrip" conversion will never lead to the same PDF file as initially.
Also the pdf2ps and ps2pdf tools aren't an independent tools at all: they are just simple wrapper scripts around a Ghostscript (gs or gswin32c.exe) command line. You can check that yourself by doing:
cat $(which ps2pdf)
cat $(which pdf2ps)
This will also reveal the (default) parameters these simple wrappers use for the respective conversions.
If you are unlucky, you will have an ancient Ghostscript installed. The PostScript which is then generated by pdf2ps will be Level 1 PS, and this will be "lossy" for many fonts which could be used by more modern PDF files, resulting in rasterization of previous vector fonts. Not exactly the output you'd like to look at...
Since both tools are using Ghostscript anyway (but behind your back), you are better off to run Ghostscript yourself. This gives you more control over the parameters it uses. Especially advantageous is the fact that this way you can get a direct PDF->PDF conversion, without any detour via an intermediary PostScript file format.
Here are a few answers which would give you some hints about what parameters you could use in order to drive the file size down in a semi-controlled way in your output PDF:
Optimize PDF files (with Ghostscript or other) (StackOverflow)
Remove / Delete all images from a PDF using Ghostscript or ImageMagick (StackOverflow)

Why are ePub files so much smaller than mobi or PDF files for the same book

When I buy ebooks I download all of the available formats. I've noticed that the file sizes for the various formats can be markedly different and epub is typically much smaller.
For example:
PDF - 5.7mb;
ePub - 2.7mb;
Mobi - 8.1mb.
Or:
PDF - 4.5mb;
ePub - 1.8mb;
Mobi - 5.3mb.
I've flipped through them and tried to confirm that the contents are the same and they seem to be (i.e. no large images missing). Can anyone explain why epub is so much smaller than the other two?
The mobi versions can be larger because they include the legacy mobi format, the new KF8 format and a copy of the original epub, this is assuming the mobi file was generated with the latest version of kindlegen.
For the PDF's I'm guessing (and that's all it is here) that embedded fonts may be the cause of a larger file size, another thing that comes into play here is image optimisation. Depending on the image optimisation settings used when the PDF was created will largely affect the final file size.
Epub's are basically just a bunch HTML, CSS and image files with a few XML files for defining the books metadata, chapter order and table of contents navigation. The epub file is really just a zip file with a .epub extension and since it doesn't have 3 copies of the same book like the Kindle version does it will always be much smaller.
Because the epubs are similar to a website. An epub book is made from XHTML & CSS2 & some features like CSS3, then the software that reads epub interpret that file and make a visual representation from that code.
.epub files are compressed (in fact, they are just zip files).
.mobi files are not compressed. If you zip a mobi file, you may get a smaller file than the epub.
Incidentally, this makes text searching much faster on mobi files than on epub.
That depends on the format of the mobi that you have. As you must be already aware, an epub file can be converted into any ebook format that you choose - you can consider the epub format as the base for any other format.
I am guessing that the mobi file that you have has the original epub embedded inside it. This is to assist editing tools (as direct editing of mobi files is cumbersome). Also, some mobi files contain several versions of the mobi(mobi-7 and KF8) to maintain backward compatibility with readers that do not support the latest format.
You can find more information about the file formats here