Recommendations for how to annotate PDFs while working within Emacs? - pdf

Many scientific papers, especially in the life sciences, are published in pdf format.
I want to work as much as possible within emacs (org-mode especially). I am aware of DocView mode which at least lets me view pdfs within emacs. I suspect it can do more for me but I haven't gotten beyond simply viewing the image based rendering of a pdf file.
Can anyone recommend ways of working with pdfs, most especially linking to files, exerting text and adding annotations to pdfs (the electronic equivalent of writing in the margins)?
Edit: Just to clarify I am not looking to actually edit the pdf image. Rather I want hyperlinked or bookmarked annotations in an org-file. I hadn't seen the text mode of DocView before, that might give me what I want but I don't know if I can bookmark/hyperlink to it.

The pdf-tools, among other things, allow to annotate pdf files in emacs. Young but promising project!
https://github.com/politza/pdf-tools

IMO, there's no one optimal workflow for managing publications in emacs. I personally simply store links to PDFs in org mode and have them open in the external viewer (evince or Acrobat, depending on the platform). There are solutions to annotate PDFs by literally writing in the margins of the PDF (in principle, Xournal, Jarnal, and some proprietary Windows software can do it), but I never found any of them very usable. When I take notes on the papers, I either store them as folded items within the org-mode structure, or as links to external files.
Other people have come up with similar workflows -- see for instance, a nice screencast here: http://tincman.wordpress.com/2011/01/04/research-paper-management-with-emacs-org-mode-and-reftex/
For me, an ideal paper-management environment would be org-mode interfaced to Mendeley. Unfortunately, the closed-source nature of Mendeley makes this rather improbable.

DocView mode can toggle between editing and viewing. But from the info pages of doc-view-mode, PDF is not (readily) human editable and the docs don't talk anything about PDF annotating capabilities.
Otherwise, Xournal or such tools should be the way to annotate PDF unless you find a way to get it working under Emacs.

This is not really an answer, but in pdf-tools it's possible to attach handler functions with selected annotations. It's just that someone has to implement it.
;; Toy Example for custom annotations.
(require 'pdf-annot)
(add-hook 'pdf-annot-activate-handler-functions 'pdf-org-annotation-handler)
(add-hook 'pdf-annot-print-annotation-functions 'pdf-org-print-annotation)
(setq pdf-annot-activate-created-annotations t)
(defvar pdf-org-annot-label-guid "www.orgmode.org/annotation"
"Unique annotation label used for org-annot annotations.")
(defun pdf-org-add-annotation (pos)
(interactive
(list (pdf-util-read-image-position "Click ...")))
(pdf-util-assert-pdf-buffer)
(pdf-annot-add-text-annotation
pos
"Circle"
`((color . "orange")
(label . ,pdf-org-annot-label-guid))))
(defun pdf-org-annotation-handler (a)
(when (equal (pdf-annot-get a 'label)
pdf-org-annot-label-guid)
(pop-to-buffer (get-buffer-create
(format "%s-annotations.org"
(pdf-annot-get-buffer a))))
;; Do SOMETHING.
t))
(defun pdf-org-print-annotation (a)
(when (equal (pdf-annot-get a 'label)
pdf-org-annot-label-guid)
"Org annotation, click to do SOMETHING"))
(provide 'pdf-org)

Related

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

Accessibility concerns for website providing massive amounts of PDFs

I am working on a website providing massive amount of PDFs for download and I am trying to improve the website accessibility. All I can think of is:
Provide equivalent content for the PDFs when possible (text or HTML for example).
Provide description for the PDF documents before the use can download them.
Make it possible to search within the PDF files when the users use the website search.
Make the links to the PDFs labelled by a nice icon.
Inform the users that they will need a third party application (Acrobat or other PDF viewers) in order to open the documents.
Are there other ways to improve it?
Like Jared said, assistive technology works decently with PDFs. The question is what kind of quality control do you have. There is a few different ways of putting together a PDF. One way is scanning a document and the result is a PDF made out of images. When assistive technology hits it, all it says is image image image, great help right?
Now Adobe built in an Optical Character Recognition ability (second way), which has improved over the years, but is far from quality. For example, I was given a PDF that had OCR on it. One of the first lines had the word Articles, in italics, the OCR spit out Art/e5. The third way is to produce PDFs containing actual text. Now Office 2007/2010, have the ability to save as a PDF. Before hitting save, click the options button and ensure the "document tags for accessibility" box is checked.
PDFs have a tag structure, like HTML, found via the Tags panel/pane. The output in 2010, is a bit cleaner than 2007, but I still recommend something like Commonlook Office to create your PDFs.
4.Make the links to the PDFs labelled by a nice icon.
You could put an icon within the link. Some people do:
Link text <img src=".." alt="PDF icon"/>
Some people using assistive tech just browse via links, so they won't know it is a PDF before they open it. So, it is better to do:
Link text <img src="" alt="PDF"/>
5.Inform the users that they will need a third party application (Acrobat or other PDF viewers) in order to open the documents.
It is a good idea to do this, in fact Section 508 requirements say to do this. I recommend linking to Adobe Reader for two reasons.
1- if the person does not have a PDF viewer, they'll probably call their "computer expert" who probably heard of Adobe Reader, and knows the site isn't pushing some ad-ware.
2- Adobe Reader has the most built-in accessibility of the readers out there, to my knowledge. So, why would you not give the best.
There are several things you can do to improve the accessibility of the PDFs themselves.
Provide "Alternate Descriptions" for images
Provide "Replacement Text" for items such as equations or abbreviations
Replacement Text can also be used to hint at the pronunciation of names
Mark the language, especially if it is mixed
This will assist a screen reader in properly understanding the PDF. This isn't crucial for pages that contain only text in regular paragraph layout - the reader can usually figure things out. If there are pictures, captions, jargon, names, etc, this will greatly improve the reader's performance.

How to create switchable multi-language pdf form?

I want to create a pdf form for two language (Chinese/English) UI, and there's a button(s) or somethings on the form for language switch, is there anyway can make it? and how to do?
thanks!
Thanks for all reply!
Actually I got a sample like this,
PDF Sample
there're two checkbox on the top-left of the form, one is for English UI, the other is Chinese, I just want to know how to make PDF like that sample? (and I don't see any layers on the sample...)
thx
mkl's comment (which he should turn into a full answer, really) already hinted at the option to use different page templates residing in the same file.
Another option you could explore is this:
put the two language versions into 2 different layers (or 'optional content groups' in PDF parlance)
make the visibility of the two layers toggeable
let the user activate that layer which he/she needs.
Layer activation can be handled through normal Acrobat Reader user interface elements.
The layer switching can be made accessible via a "button" on the PDF page too -- but that requires additional JavaScript to be embedded in the PDF (something many people are not particularly keen about).
As Kurt proposed, I make my comment on Frank's answer an answer in its own right:
Actually there is a pdf feature seldomly used nowerdays, page
templates. Thus, those two forms can reside in the same file in
different page templates, and based on some initially present buttons
("English version", ...) the desired form is spawned.
Unfortunately I don't know how to create page templates using some easy-to-use tool, I only came a cross them in the context of integrated PDF signatures (depending on the signature type, page template instantiation is a document change not breaking the signature) and tested them with low-level tools.
Essentially page templates are PDF objects just like page dictionaries of the normal pages, they are not XFA stuff. They merely are not referenced in the pages tree but instead in the name tree.
There is a JavaScript command which creates a visible page based on such a template --- I don't know which anymore; I may be able to find out when I'm back in office next week. This command would have to be bound to the inital language selection button in the file.
The problem will be in switching the static text - PDF does not allow this.
If I were you, I would split the document into two identical forms in the respective languages. You can use bookmarks and links on the first page to navigate to the right part of the document.
Note that it is possible to assign the same field names to the Enlgish/Chinese versions of your fields. This will make it easier to process the submitted form data because the process path would be independent of the chosen language. It will also simplify any JavaScript (validation, summing, etc.) you plan to add.

pdf see current line ruler

I'm looking for accessibility tool , to make it easier to read pdf's.
In short, it should be possible to easily see which line is being read ( a bit like a ruler,when it comes down to text ), to avoid losing the line that is being read.
I was wondering if anyone knows any solution for this , for example a plugin for Adobe Acrobat Reader, etc...
Any suggestions are welcome.
I don't think there is a plug-in for Acrobat Reader. You may want to look at ZoomText or ClaroRead. Of course these only work if the PDF has text, but not images of text.
A low tech solution would be to open a Notepad doc and size it how you need. If you are on Win7 you could do this with sticky notes.
Another approach I've used is to convert the PDF to HTML and then run a server with it. This is fairly simple to accomplish using Live Server in VScode.
In the Chrome browser, we may then use accessibility extensions, such as ReadingBuddies, that have reading ruler functions.
Otherwise consider,
Use a PDF reader that has a built-in reading ruler feature, such as Adobe Acrobat Reader DC or Foxit Reader.
Use a PDF reader that allows you to add a reading ruler as an annotation, such as Xodo PDF Reader.
Use an online tool that allows you to view PDFs with a reading ruler, such as Smallpdf's PDF Reader.
Use a screen ruler tool, such as the one offered by How-To Geek, to measure the PDF on your screen.
The academic term is sometimes called RSVP (Rapid Serial Visual Presentation), there are patented hardware and software versions but in principle it is simply a translucent masking added to the viewport. see https://softwarerecs.stackexchange.com/questions/28582/is-there-an-equivalent-to-a-reading-guide-strip-for-windows-os-x-or-linux and http://www.see-n-read.com/products/esee-n-read-2/
10 years later and its 2023 so software such as browsers should include such features here is Edge in some sites where Immersive Reader is supported but not StackOverflow !! The above example is using an edge extension. https://microsoftedge.microsoft.com/addons/detail/screen-mask/dfanfcmhbdocjfpmnoebccndgmhlincl others are available for other browsers https://chrome.google.com/webstore/detail/reading-ruler/phiedfcbjfjagnjikfbobmldbpmdcpfk
To get the Reader Mode options on Chrome: or Edge look at the available flags
However if you save page as PDF and read aloud it is then used there !
Some PDF readers like Mac Skim include such accessibility option.
However, simplest is :-
Most PDF readers can be reduced to focus viewport on single lines and with auto scrolling that allows for more focused "line by line" reading without the audio, plus fast and easy adjustments/enlarging for PDF variable lines with illustrations.
Note as per above PDF where much of the text is actually one or two lines out of order it is not trivial for a PDF reader to understand which text base line is independently to be used next. in reality "Read Aloud" will read two variable height lines then jump to top of page then back to the second visible line. PDF lines are not the visible order nor a constant height/spacing, you might expect.

A better file format than PDF or EPUB?

My client wants us to build a custom document viewer for their app. (It really, truly needs to be custom, because there are a ton of application-specific features they need.)
We built one for them last year that took PDFs, generated page images, and backed the images using a hidden layer of text that could be selected and copied. We did it in Flex. It was a nightmare. PDF is horrid.
This year, we need to build one in HTML 5 with similar requirements, except that most of the documents now are in Word or HTML, that is, they have reflowable text, instead of the fixed layout and glyphs of PDF. But they still want to do PDF in the same viewer.
I'm thinking that we need to convert all documents to some common file format that can handle both reflowable text and also the fixed-position glyphs of PDF. (Each document would probably support one or the other, but not both). It would be nice if it were an XML-like markup language that would say:
<text>here's some text</text>
-- or --
<glyph letter="a" name="my_a_glyph" position="10,10"/>
<image src="my_image" position="20,20"/>
or something like that.
Is there any existing file format out there that can handle it? EPUB won't do the fixed-position text, and PDF sucks in too many ways to describe.
I think you can look at FB2 (FictionBook 2) format . That is an XML-based format, designed for publishing books. It includes images, though I am not sure if they can be aligned absolutely.
Also, you can simply go with HTML and do HTML-to-PDF rendering when needed (there exists various components and libraries for this). I don't see (or you have not listed) any reasons why this way doesn't work.
GROFF? Maybe build a macro library to customize it, as needed.
Groff/troff/nroff, the "run off" programs of Unix, can output to postscript or HTML. The jump from postscript to PDF is built in to some PDF viewers; there are also several existing programs for it, pstopdf, for example.
GROFF has some fixed layout options and some flow-like options. With GROFF, it's almost easier to base most of the printout on flowing text, within proscribed bounds.