How to parse text from a plain text file and use the result to highlight a PDF file - pdf

Back in 2010, some guy claimed to be capable of doing this:
http://www.mobileread.com/forums/showthread.php?t=103847
"The Kindle stores its annotations in a Mobipocket (".mobi") file for each document and in one long text file named "My Clippings.txt." In this post I describe a system that synchronizes these annotations with PDF versions of the corresponding documents on a computer.
Overview
This system is embodied in an Applescript that parses the My Clippings file and controls the Skim PDF reader. The script first parses the clippings file. It then searches through the clippings and isolates any that come from documents on the kindle matching the filename of the currently open PDF file (the "pertinent clippings"). The script then iterates through each of the pertinent clippings, locating the matching text or location in the PDF document and applying highlights or adding notes where appropriate. The end result is an annotated, printable PDF document that matches the document on the kindle.
You can download the script here: http://dl.dropbox.com/u/2541109/KindleClippings.scpt. Before running the script, be sure to change the value of MyEmail to match your sending address and to verify that the Kindle mount point defined in MyClippingsFile is correct. You'll also need the free Skim PDF Reader.
To use it, send or copy a document file to your kindle. Remember, the kindle supports RTF, DOC, TXT and other common text formats and it will convert them into MobiPocket files internally for easier reading. Make some notes. Then take the same document that you just sent to the kindle and convert it to a PDF, e.g. by using the print to PDF feature in Mac OS X. Be sure to keep the filename the same. Open that same PDF in Skim and run the script. The highlights and notes should appear in the PDF.
If you're interested in how this works, read more on my blog here:
[not longer available]
Sadly, his script is no longer available, nor his blog.
Do you guys know if this is possible? I've been looking for this kind of functionality but can't find it anywhere.

This code, using python and PyMuPDF, works:
import fitz
# the document to annotate
doc = fitz.open("text_to_highlight.pdf")
# the text to be marked
text_list = [
"first piece of text",
"second piece of text",
"third piece of text"
]
for page in doc:
for text in text_list:
rl = page.search_for(text, quads = True)
page.add_highlight_annot(rl)
# save to a new PDF
doc.save("text_annotated.pdf")
The original 'My Clippings.txt' should be manipulated somehow, stringr could work but I found more useful to manipulate the text with multiple selections in Sublime Text---the goal is to have a list of highlights in the form of text_list above.

I am trying to do this using Python + a Windows macro creator (I'm a Win 7 user). You can use this approach to save the file as RTF, DOCX, PDF, etc. So far, it's been reasonably effective. Do note 2 things first:
1- the 'My Clippings' file only saves the text and the page, it does not save the location on the page (e.g., if you highlighted "mammals are animals" on page 15, it will give you this line and the page number, but if there are more than one "mammals are animals" on page 15, it's impossible to know which one you've highlighted). This is specially bad when you've highlighted a generic word, like "animals" or "the". And if you made comments by pressing on a word, this word is the only information you'll get about what in that page the comment refers to (e.g., I pressed on "animals" and the menu popped up, I selected 'Comment'. If "animals" appears 20 times on page 15, I cannot know to which of them my comment is refering).
2- The only way to retrieve the location on the page would be to analyze the *.pds and *.pdt files, inside the *.sdr folder in Kindle's drive ('Documents'). I can make no sense of these files.
In Python, you can run an easy code to extract the information you want from "My Clippings". Then you can use a macro creator to automate the process of copying the text and annotating it to the PDF (using Adobe Acrobat, for example), and then saving the PDF file.
Exemplifying with Adobe Acrobat:
Say I want to save all my highlights to the PDF file. First, I'll create a *.txt file on Python and run a script to copy all the strings related to the highlights to this new txt file (i.e., the highlighted text & the page number). Here's an example of such code (but first, copy and paste the "My Clippings.txt" file to the IDE start folder, e.g.: C:\Python27):
#for python 2.7.6
with open('My Clippings.txt','r') as rf:
with open('My Clippings Output.txt','w') as wf:
access = 0
bookTitle = 'Book Title'#put the book file's name as it's written in "My Clippings.txt"
for x in rf:
if access == 1:
wf.write(x)
if bookTitle in x:
access = 1
#for highlights only, instead of all annotations, include this if statement:
if (' | Added on ' in x) and ('- Your Note ' in x) or ('- Your Bookmark ' in x):
access = 0
if x == '==========\n':
access = 0
Then I'll create a macro to copy the page number in the "My Clippings Output.txt" file (it's inside the same folder you put the "My Clippings.txt" file), paste in Acrobat "page window", find (ctrl+f) the string in the page, then press "highlight". Done!
There's a catch in Acrobat though, the search/find function has a limit of ~28 chars, so your highlighted text can't be longer than that. I still don't know how to circumvent this limitation... I raised this problem here https://superuser.com/questions/884221/how-to-search-and-highlight-long-passages-in-a-pdf-file . As a bypass to the 28 chars limit on Acrobat, you can program the macro to copy using "shift"+"right arrow 28 times", and then use "cut" instead of "copy".
There are many free-to-use and libre macro creators out there, just google and choose the one you like best. For Windows, my favorite one is Pulover's Macro Creator. If you have any doubts about the process you can comment here or PM me. I'd prefer you to comment here, so that I can improve the answer

Related

Section Header Range.Text Returning Empty String Instead of Actual Text

I have a PDF file that I am trying to parse text out of. I opened the file using Microsoft Word, and text I need is in the header. On the first page, the header is justified left with a center tab that has the text (plain English name document title instead of the complicated reference name) that I am trying to grab. There is a right tab that has a page number control that I don't care about.
When I try to run the following:
Debug.Print ThisDocument.Sections(1).Headers(wdHeaderFooterPrimary).Exists
it gives me True, so I know the header exists. However, when I try to run
Debug.Print ThisDocument.Sections(1).Headers(wdHeaderFooterPrimary).Range.Text
it gives me nothing but an empty string, which I can further confirm by wrapping it in a Len(…) command which gives me 1. How can I get the text out of the header?
Of note, I tried using some Adobe SDK functions which would have been easier, but I do not have the professional Acrobat suite so I do not have access to those tools. Hence the MS Word workaround.

How to create hyperlink from a pdf to another pdf to a specified page using itext

I am using itext to create a pdf. As a final result i am downloading a zip file.After extracting it i am having directory structure as follows:-
main dir
|
|_ evidence_dir/abc.pdf
|
|_xyz.pdf
i am using this code to create the link in pdf
chunk = new Chunk( "Link" ).setAction(PdfAction.gotoRemotePage("evidence_dir/abc.pdf", "6", false, true ));
this code is for file xyz.pdf. I am getting the link create but when clicking on the link current pdf getting closed and then nothing happened.
Can anybody please help me.
Thanks,
Manish
I've create a small standalone example that shows how to create a RemoteGoto in a PDF using iText. You can download the ZIP with the resulting PDFs here. It works for me, can you check if it works for you?
Several things aren't clear from your question.
Is "6" present as a named destination in your abc.pdf? (I created an abc.pdf file with a destination named "dest")
Is "6" a named destination defined by a PDF string? (cf. your false parameter)
Are you aware of the limitations of opening a new PDF viewer window? (cf. your true parameter)
Update:
In your comment, you say that "6" should be a number, but in your code, you use a string. It's normal that that doesn't work, strings aren't numbers. Please take a look at the RemoteGoToPage example to see how it's done.
Update 2
In one of the comments, I'm asked if you can link to a specific word in an existing PDF from an HTML-link. That's a completely different question. You can do this using Open Parameters. On page 7 of this spec, you can find more info about the search parameter:
Opens the Search UI and performs a search for the specified word list
in the document. Matching words are highlighted in the document.

How to automate extracting pages from a PDF using AppleScript and Acrobat Pro?

I'm new to AppleScript, but I am trying to create a script that will go through all PDFs in a folder extracting the pages into separate files. My plan is to use a combination of Automator and AppleScript.
My AppleScript so far is:
tell application "Adobe Acrobat Pro"
open theFile
set numPages to (count active doc each page)
--execute the extraction here
end tell
The command in Acrobat Pro is under Options > Extract Pages..., where I can specify the page range and to extract to separate files. However, I can't seem to find a way to do this with the Acrobat Pro Dictionary in AppleScript.
There is an execute command that executes a menu item, but I can't seem to get it working (I'm also not sure of the syntax to use; i.e. execute "Options:Extract Pages..."?). Any help on this?
I think you can do this entirely with the Automator without the need for AppleScript or Adobe software. The "PDF to Images" action splits a multi-page .PDF file into individual .PDF files, one per page:
You can use Adobe Acrobat Pro. Here is an example using Adobe Acrobat Pro XI. It uses Acrobat's "Actions" (previously called "Batch Processing") with custom JavaScript.
Adobe Acrobat Pro - Edit Action
You can create a new action that prompts the user to select a folder of pdf files to process. Then you can add JavaScript execution that searches for the pdf file names and extractPages function to extract all the pages from a PDF
Adobe Acrobat Pro - JavaScript
The following will extract all of the pages into separate PDFs. It appends a suffix on the end with each sheet number. It pads the sheet number with zeros based on the method described in the links which basically adds a bunch of zeros in front and then only slices out to take however many last digits of the string depending on how many sheets you typically have.
/* Extract Pages to Folder */
var re = /.*\/|\.pdf$/ig;
var filename = this.path.replace(re,"");
{
for ( var i = 0; i < this.numPages; i++ )
this.extractPages
({
nStart: i,
nEnd: i,
cPath : filename + "_s" + ("000000" + (i+1)).slice (-3) + ".pdf"
});
};
References
JavaScript for Acrobat API Reference > JavaScript API > Doc > Doc methods > extractPages
Extract pages to separate pdf's (something wrong with loop?)
How can I create a Zerofilled value using JavaScript?
How to output integers with leading zeros in JavaScript [duplicate]

How to preserve formatting from rstudio when copy/pasting to Word?

I want to reproduce my code in Word 2010. The scripts were written in rstudio, and I would like to preserve rstudio's formatting when pasting into Word. Principally, I like the font colors and spacing that rstudio uses. I find that when I paste from SAS to Word, the formatting is preserved, but no dice here.
I would usually look for copy special / paste special options to do this, but I can't find any. When I try to paste special into word, only unformatted text options are presented. I would rather not reformat the text line-by-line, because I think it looks pretty nice in rstudio.
I thought of trying to save the script in rstudio to some format that would preserve its formatting, but I couldn't find any way to do this. How can it be done?
It's not totally clear whether you are pasting from RStudio's script editor (which has some 4 or 5 colors) or from the R console (script + output) within RStudio (which only has 2 colors).
If you are pasting from the console--please check "Paste special" again. There should be an option for "HTML Format" that will do what you need (though you may need to resize the font to make everything fit properly depending on your page margins).
If you are pasting from the script editor, then you're out of luck with a direct copy-and-paste solution. But there is a copy-and-paste-and-copy-and-paste solution...
One solution could be to use Notepad++. From RStudio, save your script (with a ".R" extension) then open the script in Notepad++. (Or copy and paste from RStudio to Notepad++, but make sure you set the file's language--from the "Language" menu--to R). When your script is correctly highlighted in Notepad++ go to the "Plugins > NppExport > Copy HTML to clipboard" menu to copy the open file. This can then be pasted into MS Word with HTML format.
Just in case someone else looks for this question...
Another way to have all the source code in a word document with a good-looking format using RStudio is to use the File/Compile Notebook option, choosing MS Word as the output format.
Using this option, a .docx document will be generated with the output of your script as well as the original source code. The script will be executed, though.
If you don't want your code to be evaluated (you just want a simple copy-paste), you can add #+eval=FALSE at the beginning of your script and then the source code will be reproduced in the word document without being evaluated.
This approach relies on knitr. Here is an example if anyone wants to start playing with this.
#' ---
#' title: "My homework"
#' author: John Doe
#' date: June 15, 2015
#' output: word_document
#' ---
# The header above sets some metadata used in the knitr output
# Conventional comments are formatted as regular comments
# Comments starting with "#+" control different knitr options.
#+echo=FALSE,message=FALSE,warning=FALSE
library(ggplot2)
#+echo=TRUE
#' Comments with a "+" sign are used to tell knitr what should be
#' done with the chunk of code:
#'
#' - echo: Show the original code or not
#' - eval: Run the original code or not
#' - message: Print messages
#' - warning: Print warnings
#' - error: Print errors
#' ...
#' Comments with an apostrophe "'" will be printed as regular text.
#' This is very useful to explain what you are actually doing!
# Regular comments can be used to document the code as usual
# Figures are printed:
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point(aes(color=class))
#' Formatting **options** are possible.
#' Even [links](http://stackoverflow.com/questions/10128702/how-to-preserve-formatting-from-rstudio-when-copy-pasting-to-word)
#'
#' This will show all the packages and versions used to generate this document.
#' It can be used to make sure that your teacher has all he needs to run your script
#' if he/she wants to.
sessionInfo()
Assuming you have internet access
Copy and paste to gist.gisthub.com
Select 'R' as the language - this should provide colours
Hit create (secret or public) gist
Copy and paste from the gist to your word processor.
Compared with the notepad++ solution:
An online backup to your code, with a recording of the time when you clipped it.
You don't have to install any other software, useful if you're a student using a public computer.
If you just need the code as formatted:
Step1: Just add #+eval=FALSE at the beginning of your code.
Step2: Then go to File -> Knit Document. Compile the file in msword/PDF/Html.
OR
Just add #+eval=FALSE at the beginning of your code.
Press CTRL+SHIFT+K and then compile the file in msword/PDF/Html.
If you need the code with output do not enter add #+eval=FALSE at the beginning of your code and perform step 2 directly.
I agree with zeehio that using Knitr is probably the best option. But another way is to use the Pretty R tool and the "open document text" steps here. Basically just copy and paste your code into pretty R, and copy and paste the output (not the html) into the open document.
After you copy from the Rstudio Console window and paste into a Word document, you need to highlight all the the just copied text and change the font into Courier New. This will give you the same spacing and lineup as you had in the Rstudio Console window.
Copy paste the code from Rstudio editor to 'visual studio code' & then again copy from there into a word processor.
For this to happen you must first install R extension in visual studio code.
'Visual studio code' is itself an IDE which can potentially be used for R language as well, but right now I'm emphasizing on using it to answer the above question.
In R I use the Monaco editor font. To copy paste the output of the R consol in Microsoft Word, I select the output of the consol, right click and copy and paste in my Word document. Once I have pasted the output in word, I select it and put it in Word's Monaco font and reduce the size of the font if necessary.
This does the job very nicely and perfectly preserves the output style from the R consol, as well as written chunks of code.
If you want to retain the formatting when coping a selection from the R Console you will need to install an older version of R Studio. Version 1.2.5042. it will not work in the newer versions

Convert text to image in Microsoft Word

I have a large book written in Microsoft Word and want to create a macro that will find all text using a predefined style and convert that text to an inline image. This text will be in Arabic and generally no longer than 4-5 lines. Is this possible?
UPDATE: Here's an example to show what I'm referring to:
I want to replace that entire line in Arabic with an image (as if I cropped this attached image to only include the Arabic and then replaced the line in Arabic with the image).
The reason I want a macro or script to do this is because there are hundreds of such lines and updating them one by one is cumbersome plus that will make modifications difficult later on.
UPDATE2: I found an interesting option here: http://windowssecrets.com/forums/showthread.php/31344-Convert-Text-to-an-Image-of-Text-in-VBA-(Office-2000-Sr1a)
It looks like you can cut a piece of text and then "Paste Special" as an image. So if there's a way to automate that that might work.
This is not an answer although I hope it will grow into a community answer. At the moment it is an exploration of what is required to solve the problem.
I know from the discussion when this question was posted on Super User that Abdullah wishes to publish his book on Kindle. So the question is really about how to get a document in English and Arabic ready for publication as an e-Book.
The Kindle does not support Arabic. The number of languages it does support is slowly increasing but there is no evidence I can find that Amazon has plans to add Arabic in the foreseeable future.
The format behind an Amazon e-Book is a cut down version of HTML. If a Word document containing Arabic letters is exported to HTML, the Arabic letters are included as character entities; for example: “ﭐ &#amp;64337; ﭒ ﭓ”. Importing the original Word or the HTML version to Kindle, results in the leading bits being discarded so these characters are displayed as P, Q, R and S instead of “ﭐ ﭑ ﭒ ﭓ (Alef Wasla isolated form, Alef Wasla final form, Beeh Wasla isolated form and Beeh Wasla final form).
I have tried Abdullah’s idea of saving some Arabic letters in a PNG file and creating an HTML file containing <p> … </p> <img src= “Arabic.png” > <p> … </p>. The appearance of this file on my Kindle 2 is perfectly acceptable so this has the potential to be a solution. The question is: how can the necessary conversions be performed?
We need to extract each Arabic string from either the Word document or its HTML equivalent and import it into a program that can convert them to PNG files.
The only way that I know of automating this would be to copy each string to a slide within PowerPoint. With PowerPoint’s SaveAs option it is possible to save each slide as a separate PNG file. The slides are named: SLIDE1.PNG, SLIDE2.PNG, SLIDE3.PNG and so on in sequence which would allow a macro to relate the results to the original strings. It would then be possible to replace the Arabic strings in the HTML file with the image elements. None of this would be too difficult to automate but there is a problem with the slides all being the size of the PowerPoint page. The page could be made smallish but what we need is for each slide to be cropped to just bigger than that slide’s text. I cannot think of any way of automating this cropping.
Does anyone have a better approach than converting each Arabic phrase to a PNG file?
I have been looking for PNG editors with some sort of command line interface but can find nothing that would be easier than using PowerPoint. Does anyone know of an alternative to PowerPoint?
Does anyone have any suggestions for automating the cropping of each image? When a string is placed in a PowerPoint slide it is possible to set its width to, say, 6.5cm (which looks good on my Kindle) and get the height determined by PowerPoint. This could be saved for later use if anyone knows how to use it.
Implementing solution
Pending any suggestions for improving the approach described above, the following outlines how I would implement it.
I would not attempt to process the Word document. I would save it as a Web Page, Filtered HTML file, which is a required step on the way to creating a Kindle eBook, and process that.
Within the HTML file created from my test document, the Arabic phrase comes out as:
<p class="MsoNormal"></p>
<p class="MsoNormal" align="center" style="text-align:center"><span dir="RTL"
style="font-size:24.0pt;font-family:Arial">
&#64336;&#64337;&#64338;&#64339;&#64340;&#64341;
&#64342;&#64343;&#65153;&#65154;&#65276;&#65275;
&#65274;&#65273;&#65246;&#65226;&#65227;&#65228;
</span><span style="font-size:24.0pt"></span></p>
<p class="MsoNormal"></p>
<p class="MsoNormal"></p>
I assume Abdullah's document will result in something similar. Note 1: the above is a random collection of Arabic letters. Note 2: they are held left-to-right in reading sequence even though, when displayed or printed, they are read right-to-left.
The whole of this block will have to be replaced with something like:
<br><imc src="xxxx.png"><br>
where the file xxxx.png holds an image of the Arabic text.
The file names, such as xxxx.png, could be systematic (A001.png, A002.png, ...) but I would have thought that transliterating the first ten or twenty characters of the phrase from the Arabic to English alphabets and using the result, with a numeric suffix, as the file name would be more convenient.
I would hold the records necessary to manage the process in an Excel worksheet. I would place the VBA code in the same workbook.
The steps in the conversion process that I envisage are:
VBA macro to extract Arabic strings from latest HTML file and add new strings to the Excel worksheet. (More about the Excel worksheet later.)
VBA macro to create PowerPoint file, with one slide per new string, and use SaveAs in PNG format to create one PNG file per slide before discarding the PowerPoint file.
Human to crop each PNG file. (There appears to be no way of automating the cropping so this task will be minimised by use of data in the Excel worksheet.)
VBA macro to rename each slide from SLIDEnnn.PNG to its permanent name and to record the permanent name in the Excel worksheet.
VBA macro to update the latest HTML file by replacing the block containing the Arabic phrase with the appropriate HTML IMG element.
The Excel worksheet needs two columns: Arabic phrase and PNG file name. If there is any risk of the worksheet being sorted between steps 2 and 4, we may need a sequence number as well.
Macro 1 will extract an Arabic phrase from the HTML file, look down the list in the worksheet for this phrase and add the phrase at the bottom if it is not already present.
Macro 2 will look for phrases in the worksheet that do not have a PNG file name. These new phrases are the ones to be written to the PowerPoint presentation. That is, a phrase only goes into this process once.
Task 3, cropping each PNG file, will be a pain. All I can say is that it will only be once per phrase.
Macro 4 will assume that the SLIDE001.PNG, SLIDE002.PNG, … are in the sequence of phrases without PNG files in the worksheet. If this might not be true (because the worksheet has been sorted) we will either need a sequence number or to retain the PowerPoint file. The macro will assign a unique name to each new phrase, record this name in the worksheet and rename the PNG file.
Macro 5 creates a new copy of the latest HTML file using the contents of the worksheet to determine which phrase to replace with which PNG file.
This process is not ideal but it will achieve the desired result and has no obvious complications. Any suggestions for improving it?
Before you begin these instructions, press record in the Microsoft Word macro editor, so you can see what the VBA code is.
I'm wondering if this will be easier if you convert the docx file to .rtf (rich text format) and replace that line with an image? Go to File > Save As.. > name it "old.rtf", then replace the line with an image and Save As.. again and name it "new.rtf" and then download Beyond Compare or your favorite diff program to see what happened. It should be easy to do this pro-grammatically if you choose to. I think working in text would be easier than Microsoft's binary format unless you can find a good library to modify their doc or docx formats.
Sub CopySelPasteAsPicture()
' Take a picture of a selection and paste it at the
' document end
With Selection
.CopyAsPicture
End With
ActiveDocument.Content.Select
With Selection
.Collapse Direction:=wdCollapseEnd
.TypeParagraph
.TypeParagraph
.PasteSpecial DataType:=wdPasteMetafilePicture
End With
End Sub