progressive pdf download - objective-c

I want to download a pdf file progressively in an iPad application. I m not sure how to do that and google wasn't very helpful. can anyone help me understand the concepts here please. I am planning to render in core graphics.
Thanks.

Do you mean you want to render pdf pages before download is completed? If yes:
First of all, PDF format initially was not designed for that.
Let me explain. PDF file consists of a number of objects and xref. xref is a table containing location (in bytes from the beginning) of every object withing the file, so objects may be located at random locations withing the file. Even worse, xref itself is located at the end of file, so you can't locate any object in the file until you download it.
So, PDF is designed for random access. Actually, HTTP protocol allows it, so if you really need it, you can try to implement it :)
Good news for you: starting from PDF-1.2 there is a special feature called "Linearized PDF". It is designed exactly for your task, so you can render the first page before the next one if downloaded. You can google around or check out pdf reference for more details. The most important thing: you have to linearize pdf file using special tools, so not every pdf file can be rendered progressively.
Bad news for you: looks like core graphics doesn't support. I didn't tried it actually, but I found nothing re linearized pdf in core graphics documentation. (Please let me know if you will find anything.) So you may need to render PDF manually.

Not entirely sure about for iPad, but doing a Save as... in Acrobat by default it will be optimized as Fast Web View, which allows downloads a page at a time instead of the whole document in one go.
http://www.adobe.com/designcenter-archive/acrobat/articles/acr6optimize/acr6optimize.pdf

Linearzied PDF will meet your needs. You need a capable reader such as the one from Adobe to utilize this feature.

Related

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Providing an embedded webkit with resources from memory

I'm working on an application that embeds WebKit (via the Gtk bindings). I'm trying to add support for viewing CHM documents (Microsoft's bundled HTML format).
HTML files in such documents have links to images, CSS etc. of the form "/blah.gif" or "/layout.css" and I need to catch these to provide the actual data. I understand how to hook into the "resource-request-starting" signal and one option would be to unpack parts of the document to temporary files and change the uri at this point to point at these files.
What I'd like to do, however, is provide WebKit with the relevant chunk of memory. As far as I can see, you can't do this by catching resource-request-starting, but maybe there's another way to hook in?
An alternative is to base64-encode the image into a data: URI. It's not exactly better than using a temporary file, but it may be simpler to code.

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

Web served image in PDF?

Does PDF and/or Adobe Reader support including an image by URL so that you can insert a dynamic images from a web server into a document?
The answer to your question is both yes and no. If you look in the PDF spec (I'm going by version 1.7) in section 7.11.5, you'll see that a stream within a PDF document can be represented by an URL. So yes, you can go ahead and specify that a PDF has, say, its image content in the specified URL.
The problem will be that when you specify an image within PDF, you are specifying a PARTICULAR image that must have a particular data length and encoding. Simply specifying dimensions, dct compression (aka jpg), and URL is not enough. Images are contained in streams of a particular length. If the stream is too long or too short, it is considered an error.
So you can have images dynamically served up, provided that they are always exactly the same byte length. I think. And I say this because the specification is somewhat ambiguous as to what happens when you set the length to 0 in the stream dictionary.
Now, is doing this practical? Maybe - you'll need a fairly strong PDF toolkit in order to be able to author these documents. And if you have that, I think you'd be better off authoring the entire PDF document that your clients want on the fly rather than trying to substitute an image at read time.
I don't believe you can place a dynamic image in a PDF document in this manner. It's possible to dynamically create an entire PDF document using web-hosted content (using PHP, Coldfusion, etc.) but changing that content later on the web server will not dynamically update previously generated PDF documents, which is what it sounds like you want to do.
As PDFs are meant to be portable by nature (PORTABLE Document Format) and thus, not always viewed online, this goes against the very principle of the document format, and is not supported as far as I know.
You could include a reference to an image at the time of generation of the PDF, but said image will embedded into the PDF, not linked.
You could use pdf.js and modify the rendering methods slightly so that you inject your image. You can find pdf.js here: https://github.com/mozilla/pdf.js
You can also use FlexPaper which has an API that allows you to overlay your document with images
http://flexpaper.devaldi.com/

Create destinations for all bookmarks in a PDF file with iText API

I'd like to write some (java) code that takes a PDF document, and creates named destinations from all of the bookmarks. I think the iText API is the easiest way of doing this, but I have never used the API before.
How would you go about writing this sort of code with the iText API? Can iText do the parsing needed to manipulate existing PDFs by itself? The kind of manipulations I am thinking of are:
Open,
Find bookmarks,
Create destinations,
Save,
Close.
Or is there a different API that would be better?
Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.
Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.
I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).
The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.
I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.
To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)
What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.
As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.
Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.
Sorry I'm not more helpful to you. Good luck.
P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.