Where does Preview store PDF annotations on OS X Lion? - pdf

I'm working on a tool in Python to extract highlighted passages from PDF files. I regularly highlight PDFs in Preview on OS X Lion but haven't found a good tool to extract these passages. Other apps exist that do allow you to highlight and export such as Skim but I figure there has to be a way to extract the ones I add in Preview.
I figured that the highlights would be stored in the HFS+ extended attributes for the PDF file but after looking at them using xattr it seems that they're stored elsewhere. I also looked at PDFKit but I only saw how to create annotations rather than locate them.
If someone could tell me where to find the highlights/annotations or point me at some documentation that explains this I would really appreciate it.

When using PDFKit you can get annotation from any PDFPage instance.
[myPDFPage annotations] will return an array of annotations for that particular page.
See the docs for more info.

Technically speaking, highlighting parts of a PDF is adding an annotation to the file. These annotations are PDF objects defined in the PDF specification. They are stored inside the PDF file itself, i.e. they do modify the original file! That's why you'll not find a trace of the highlights in the HFS+ extended attributes...
So the answer to the question of your title line is: Preview stores the highlights inside the PDF file as fully compliant PDF objects.
The answer to your real question implied in your text ('I want to extract the highlighted passages') was well answered by sosborn.

Related

iTextSharp - when extracting a page it fails to carry over Adobe rectangle highlighting important info

Per the following site...
http://forums.asp.net/t/1630140.aspx?extracting+pdf+pages+using+itextsharp
...I use the function ExtractPages to produce a new PDF based on range of page numbers. My problem is that I noticed a PDF that had a rectangle on the 2nd page was not extracted along with the page. This causes me some fear that perhaps Adobe comments are not being carried over as well as the pages get extracted.
Is there a way I can adjust this code to take into consideration to bring over comments and objects like rectangles to the new PDF when ExtractPages is called? Am I missing a syntax or is that not available with version 5.5.0 of iTextSharp?
Your use of the verb extract in the context of extracting pages is confusing. People will think you want to extract text from a page. In reality, you want to import or copy pages.
The example you refer to uses PdfWriter. That's wrong: you should use PdfStamper (if only one existing PDF is involved) or PdfCopy (if multiple existing PDFs are involved). See my answer to the question How to keep original rotate page in itextSharp (dll) to find out why the example on forums.asp.net is a really, really bad example.
The fact that a page has "a rectangle" is irrelevant. Maybe the rectangle is an annotation. In that case, you're throwing that rectangle away by using the wrong example. Maybe the origin of the page is different from 0,0...
If your purpose is to create a new PDF containing only a selection of pages of the original PDF, please read my answer to Function that I can use to remove a single page from a PDF using iText

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

Get text from a pdf in NSString

I am trying to make an iOS app which would extract plain text from a pdf file and display it in a UITextView. Its simply not a pdf reader to view a pdf file but i would later wish to perform certain operations on that text.
I have already googled a lot but still not able to get an exact solution.
i already tried using https://github.com/zachron/pdfiphone
but the files are using ARMV6 architecture which seems obsolete with xcode 4.5
And if anyone can suggest some exact and non-confusing code using Quartz-2d framework of iOS then it would be great.
Here is An Sample code to Extract text from PDF Hope this Might Help You.
https://github.com/zachron/pdfiphone
This is a library to get the text out of a PDF for the iPhone.
Another Demo is there Which uses OCR technology find the link below
https://github.com/nolanbrown/Tesseract-iPhone-Demo
Also Check this page of the Quartz 2D Programming Guide, it covers everything you need to open and parse a PDF file in iOS. Note that it is not a simple task, since there's no method to extract the full text in one line. You have to work with the data as an input stream, using a CGPDFScanner
Two Other Libraries
https://github.com/KurtCode/PDFKitten/
https://github.com/mobfarm/FastPdfKit
This question comes up all the time. It is VERY hard to extract text from PDF in general. The PDF specification is not designed with text extraction in mind. There are many libraries that try to do the job, essentially by reconstructing the text from the geometric placement of the individual glyphs. These libraries have varying degrees of success, but will all fail on certain PDF documents. In fact, some PDF documents have Glyphs but no way to associate the glyph with a character. For these documents it is simply not possible to extract text, short of using some kind of OCR approach.
PDF is designed as a read-only format that is portable in the sense that a PDF document will be rendered identically on any platform. That is what it is best at, and what it should be used for.
If text is to be edited, do not use PDF.
Here (Extracting text from pdf using objective-c), I found an answer to your question and it works. But not so fine as i need it :(
it can extract only ascii
it return me only one paragraph
Good luck.

Extract font and its corresponding cmap in PDF

I am tried several ways to extract font from pdf viz. fontforge, mupdf, pdfparser in C# and also some pythone script. But am just confusing about get exact pair of a font and its cmap embeded in pdf. Please direct me the right approach by which i will get exact pairs of fonts and its cmaps.
As mentioned in my first comment, that should be easy using iText or iTextSharp or any other such library that allows you to access low-level PDF objects.
In case of iText(Sharp), ListUsedFonts.java and ListUsedFonts.cs can present starting points for you; they inspect all the font dictionaries in a PDF file accessible via at least one page. Instead of the simple output of those examples, simply export all the information you need. For this, ISO 32000-1:2008 should be your reference guide.

Reading text + graphic (like lines) info from an existing pdf

I want to read an existing pdf & extract the text and graphics information. Within graphics, currently i just need the drawn lines. There are many vendor component for reading PDF text, but are there ones that can give graphics info too ? Though free/open-source is preferred, I'm ok to commercial ones too.
The requirement is:
For every page in PDF:
Reading text blocks
Getting to know the canvas co-ordinate of the text block (rectangle containing the block). Note, for text with higher font size, the rect size will change.
Lines - need collection of (x1,y1,x2,y2) for every line in a page in pdf
Thanks,
- Seeker
This is my field, though the question is a bit old. Hopefully this still helps.
You leave some room for assumptions, so here are mine:
you seek a script, rather than stand-alone software
your object is archival
you are running command-line scripts:
Use this command line script, detailed at: http://stefaanlippens.net/extract-images-from-pdf-documents
you are running server-side code using imagemagick or graphicsmagick functions:
Something like "convert -background white -flatten test1.pdf test1.jpg" (imagemagick) will render the whole PDF page into a jpeg. If you want to then crop it to the image(s), then it depends upon the context of the project to determine the best script(s) to do that.
A rather complex question. If you wish to provide more details about the project, then I can provide some more guidance. Best of luck.