Pandoc: Is there a way to include an appendix of links in a PDF from markdown? - pdf

I use Markdown and Pandoc extensively. However, I would like to generate a PDF with embedded links (like usual), but in the event the document is printed, I'd like to also include a table of links at the end of the document. Is there a way to do this automatically?
Ex.
Title
-----
[Python][] is cool!
...
## Links ##
[Python]: http://python.org
[Pip]: https://pip.readthedocs.org
where I would actually get an extra page in my PDF with something like
Python: http://python.org
Pip: https://pip.readthedocs.org
Thanks!

This is something that is easy to achieve with filters.
Here is linkTable.hs. A filter which adds a table of links to the end of your document.
import Text.Pandoc.JSON
import Text.Pandoc.Walk
import Data.Monoid
main :: IO ()
main = toJSONFilter appendLinkTable
appendLinkTable :: Pandoc -> Pandoc
appendLinkTable (Pandoc m bs) = Pandoc m (bs ++ linkTable bs)
linkTable :: [Block] -> [Block]
linkTable p = [Header 2 ("linkTable", [], []) [Str "Links"] , Para links]
where
links = concatMap makeRow $ query getLink p
getLink (Link txt (url, _)) = [(url,txt)]
getLink _ = []
makeRow (url, txt) = txt ++ [Str ":", Space, Link [Str url] (url, ""), LineBreak]
Compile the filter with ghc linkTable.hs. The output is as follows.
> ghc linkTable.hs
[1 of 1] Compiling Main ( linkTable.hs, linkTable.o )
Linking linkTable ...
> cat example.md
Title
-----
[Python][] is cool!
[Pip] is a package manager.
...
[Python]: http://python.org
[Pip]: https://pip.readthedocs.org
Then running pandoc with the filter.
> pandoc -t markdown --filter=./linkTable example.md
Title
-----
[Python](http://python.org) is cool!
[Pip](https://pip.readthedocs.org) is a package manager.
...
Links {#linkTable}
-----
Python: <http://python.org>\
Pip: <https://pip.readthedocs.org>\

Related

How to make a table of icons and text in Rmarkdown

In several web pages for courses I have a Resources page that lists some recommended
books, using icons of the cover and text that contains links to related materials.
I can do this, as shown below, but the code in my .Rmd file is unnecessarily complex, making it
a chore to add new books.
How can I simplify this code in the chunks? I use several functions defined below
```{r do-books, echo=FALSE}
width <- "160px"
tab(class="cellpadding", width="800px",
tr(
tabfig("fox", "images/books/car-3e.jpg",
"https://us.sagepub.com/en-us/nam/an-r-companion-to-applied-regression/book246125", width=width),
tabtxt("Fox & Weisberg,", a("An R Companion to Applied Regression",
href="https://us.sagepub.com/en-us/nam/an-r-companion-to-applied-regression/book246125"),
". A comprehensive introduction to linear models, regression diagnostics, etc.", br()
# "Course notes at", aself("http://ccom.unh.edu/vislab/VisCourse/index.html")
)
),
tr(
tabfig("Wickham", "images/books/ggplot-book-2ndEd.jpg",
"https://www.springer.com/gp/book/9780387981413", width=width),
tabtxt("Hadley Wickham,", a("ggplot2: Elegant Graphics for Data Analysis",
href="https://www.springer.com/gp/book/9780387981413"),
". The printed version of the ggplot2 book. The 3rd edition is online at",
a("https://ggplot2-book.org/", href= "https://ggplot2-book.org/")
)
)
)
```
The functions tabfig and tabtext are defined in a sourced file:
library(htmltools)
# table tags
tab <- function (...)
tags$table(...)
td <- function (...)
tags$td(...)
tr <- function (...)
tags$tr(...)
# an <a> tag with href as the text to be displayed
aself <- function (href, ...)
a(href, href=href, ...)
# thumnail figure with href in a table column / row
tabfig <- function(name, img, href, ...) {
td(
a(class = "thumbnail", title = name, href = href,
img(src = img, ...)
)
)
}
tabtxt <- function(text, ...) {
td(text, ...)
}

Translation with google trad api

I'm trying to write a program that takes the text of a file, for example PDF, and translates the text extracted with the Google API, except that the API doesn't work with my code. I don't have a clue why it isn't working.
I've already tried to modify my code but nothing I've done works.
from tika import parser
# from googletrans import Translator
import os
from textblob import TextBlob
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")
raw = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = raw['content']
text = text.replace('https://mp4directs.com','')
text = text.replace('\t','')
text = text.replace('\r','')
fichier = open("arifureta.txt", "a")
fichier.write(text)
fichier.close()
fic = open("arifureta-formater.txt", "a")
cpt=0
with open("arifureta.txt") as f :
for line in f :
if len(line)==1 :
cpt+=1
else :
cpt=0
if cpt<2:
fic.write(line)
fic.close()
nbLigneTraité = 0
fic2 = open("arifureta-traduit.txt", "a")
compteur=0
textPasTraduit=''
with open("arifureta-formater.txt") as f :
for line in f :
fic2.write(str(blob.translate(from_lang='en',to='fr')))
if len(line)>1:
textPasTraduit += line
compteur+=1
if compteur%1000==0:
blob = TextBlob(textPasTraduit)
try:
fic2.write(str(blob.translate(from_lang='en',to='fr')))
print(blob.translate(from_lang='en',to='fr'))
except Exception as e:
pass
nbLigneTraité+=1
print(nbLigneTraité)
if len(line)==1:
fic2.write('\n')
fic2.close()
I expect to have the entire translation of the PDF's text in the result file, but actually the answer is 'broken link'. I think it is due to the quantity of text, but I haven't find a way to try any other method.

R markdown: simplify creating tables of figures and text

For R markdown Rmd web pages I want to generate tables containing in the first column thumbnail images (that link to a larger image or a web site) and
descriptive text in the 2nd column. One example is the following image:
I know I can create this manually in raw HTML, but that is very fiddly and time-consuming. There must be some easier way.
On a different page, I tried a markdown / pandoc table, but that didn't work, and I reverted to manual coding of HTML
icon | title
--------------------------------------------------+--------------------------
<img src="images/books/R-Graphics.jpg" height=50> |Paul Murrell, *R Graphics*, 2nd Ed.
<img src="images/books/R-graphics-cookbook.jpg" height=50> | Winston Chang, R Graphics Cookbook
<img src="images/books/lattice.png" height=50> | Deepayan Sarkar, *lattice*
<img src="images/books/ggplot2.jpg" height=50> | Hadley Wickham, *ggplot2*
Perhaps the htmltools package would be useful here, but I can't quite see how to use it in my Rmd files for this application.
Probably forgot escaping quotes? This works fine for me:
---
title: "The Mighty Doge"
output: html_document
---
```{r}
library(knitr)
create_thumbnail <- function(file) {
paste0("<img src=\"", file, "\" style=\"width: 50px;\"/>")
}
df <- data.frame(Image = rep("unnamed.png", 5),
Description = rep("Doge", 5))
df$Image <- create_thumbnail(df$Image)
kable(df)
```
Here is an approach that uses htmltools and seems much more flexible, in that I can control the details somewhat more easily.
I'm not familiar with bootstrap <div> constructs, so I used HTML table constructs. I had to define functions for tr(), td() etc.
```{r html-setup, echo=FALSE}
library(htmltools)
# table tags
tab <- function (...)
tags$table(...)
td <- function (...)
tags$td(...)
tr <- function (...)
tags$tr(...)
# an <a> tag with href as the text to be displayed
aself <- function (href, ...)
a(href, href=href, ...)
```
Then functions to construct my table entries the way I wanted:
```{r table-functions, echo=FALSE}
# thumnail figure with href in a table column
tabfig <- function(name, img, href, width) {
td(
a(class = "thumbnail", title = name, href = href,
img(src = img, width=width)
)
)
}
tabtxt <- function(text, ...) {
td(text, ...)
}
```
Finally, use them to input the entries:
## Blogs
```{r do-blogs, echo=FALSE}
width="160px"
tab(
tr(
tabfig("FlowingData", "images/blogs/flowingdata.png", "http://flowingdata.com/", width=width),
tabtxt("Nathan Yau,", aself("flowingdata.com/"),
"A large number of blog posts illustrating data visualization methods with tutorials on how do do these with R and other software.")
),
tr(
tabfig("Junk Charts", "images/blogs/junkcharts.png", "http://junkcharts.typepad.com/", width=width),
tabtxt("Kaiser Fung,", aself("http://junkcharts.typepad.com/"),
"Fung discusses a variety of data displays and how they can be improved.")
),
tr(
tabfig("Data Stories", "images/blogs/datastories.png", "http://datastori.es/", width=width),
tabtxt("A podcast on data visualization with Enrico Bertini and Moritz Stefaner,",
aself("http://datastori.es/"),
"Interviews with over 100 graphic designers & developers.")
)
)
```
I still need to tweak the padding, but this gives me more or less what I was after:

cannot import Control.Proxy.Trans.Either

I'm trying to learn how to use pipes together with attoparsec by following the tutorial https://hackage.haskell.org/package/pipes-attoparsec-0.1.0.1/docs/Control-Proxy-Attoparsec-Tutorial.html . But I was not able to import Control.Proxy.Trans.Either . In which lib is this module located?
You hit on an old version of pipes-attoparsec corresponding to an old version of pipes. With recent versions, something like the first example would be written without a pipe. We would use the parsed function, which just applies a parser repeatedly until it fails, streaming good parses as they come.
{-# LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Attoparsec
import Data.Attoparsec.Text
import Data.Text (Text)
data Name = Name Text deriving (Show)
hello :: Parser Name
hello = fmap Name $ "Hello " *> takeWhile1 (/='.') <* "."
helloparses :: Monad m => Producer Text m r -> Producer Name m (Either (ParsingError, Producer Text m r) r)
helloparses = parsed hello
process txt = do
e <- runEffect $ helloparses txt >-> P.print
case e of
Left (err,rest) -> print err >> runEffect (rest >-> P.print)
Right () -> return ()
input1, input2 :: Monad m => Producer Text m ()
input1 = each
[ "Hello Kate."
, "Hello Mary.Hello Jef"
, "f."
, "Hel"
, "lo Tom."
]
input2 = input1 >> yield "garbage"
Then we see
-- >>> process input1
-- Name "Kate"
-- Name "Mary"
-- Name "Jeff"
-- Name "Tom"
-- >>> process input2
-- Name "Kate"
-- Name "Mary"
-- Name "Jeff"
-- Name "Tom"
-- ParsingError {peContexts = [], peMessage = "string"}
-- "garbage"
The other principle function pipes-attoparsec defined is just parse. This converts an attoparsec parser into a pipes StateT parser to parse an initial segment of a producer that matches the parser. You can read about them here http://www.haskellforall.com/2014/02/pipes-parse-30-lens-based-parsing.html

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.