For R markdown Rmd web pages I want to generate tables containing in the first column thumbnail images (that link to a larger image or a web site) and
descriptive text in the 2nd column. One example is the following image:
I know I can create this manually in raw HTML, but that is very fiddly and time-consuming. There must be some easier way.
On a different page, I tried a markdown / pandoc table, but that didn't work, and I reverted to manual coding of HTML
icon | title
--------------------------------------------------+--------------------------
<img src="images/books/R-Graphics.jpg" height=50> |Paul Murrell, *R Graphics*, 2nd Ed.
<img src="images/books/R-graphics-cookbook.jpg" height=50> | Winston Chang, R Graphics Cookbook
<img src="images/books/lattice.png" height=50> | Deepayan Sarkar, *lattice*
<img src="images/books/ggplot2.jpg" height=50> | Hadley Wickham, *ggplot2*
Perhaps the htmltools package would be useful here, but I can't quite see how to use it in my Rmd files for this application.
Probably forgot escaping quotes? This works fine for me:
---
title: "The Mighty Doge"
output: html_document
---
```{r}
library(knitr)
create_thumbnail <- function(file) {
paste0("<img src=\"", file, "\" style=\"width: 50px;\"/>")
}
df <- data.frame(Image = rep("unnamed.png", 5),
Description = rep("Doge", 5))
df$Image <- create_thumbnail(df$Image)
kable(df)
```
Here is an approach that uses htmltools and seems much more flexible, in that I can control the details somewhat more easily.
I'm not familiar with bootstrap <div> constructs, so I used HTML table constructs. I had to define functions for tr(), td() etc.
```{r html-setup, echo=FALSE}
library(htmltools)
# table tags
tab <- function (...)
tags$table(...)
td <- function (...)
tags$td(...)
tr <- function (...)
tags$tr(...)
# an <a> tag with href as the text to be displayed
aself <- function (href, ...)
a(href, href=href, ...)
```
Then functions to construct my table entries the way I wanted:
```{r table-functions, echo=FALSE}
# thumnail figure with href in a table column
tabfig <- function(name, img, href, width) {
td(
a(class = "thumbnail", title = name, href = href,
img(src = img, width=width)
)
)
}
tabtxt <- function(text, ...) {
td(text, ...)
}
```
Finally, use them to input the entries:
## Blogs
```{r do-blogs, echo=FALSE}
width="160px"
tab(
tr(
tabfig("FlowingData", "images/blogs/flowingdata.png", "http://flowingdata.com/", width=width),
tabtxt("Nathan Yau,", aself("flowingdata.com/"),
"A large number of blog posts illustrating data visualization methods with tutorials on how do do these with R and other software.")
),
tr(
tabfig("Junk Charts", "images/blogs/junkcharts.png", "http://junkcharts.typepad.com/", width=width),
tabtxt("Kaiser Fung,", aself("http://junkcharts.typepad.com/"),
"Fung discusses a variety of data displays and how they can be improved.")
),
tr(
tabfig("Data Stories", "images/blogs/datastories.png", "http://datastori.es/", width=width),
tabtxt("A podcast on data visualization with Enrico Bertini and Moritz Stefaner,",
aself("http://datastori.es/"),
"Interviews with over 100 graphic designers & developers.")
)
)
```
I still need to tweak the padding, but this gives me more or less what I was after:
Related
How can I put 2 chunks of code side by side in the output file of RMarkdown or Quarto ?
Code
library(dplyr)
mtcars %>% select(gear)
library(dplyr)
select(mtcars, gear)
Desired layout in the PDF or HTML file
The canonical way for something like this is to use column divs:
::::: columns
::: column
```r
library(dplyr)
mtcars %>% select(gear)
```
:::
::: column
```r
library(dplyr)
select(mtcars, gear)
```
:::
:::::
This will work with HTML, reveal.js, Beamer, and Powerpoint. The default result looks a bit ugly in HTML, as there is no space between the two blocks, but we can fix that with a tiny bit of CSS. We can put it directly into the document:
<style>
.column { padding-right: 1ex }
.column + .column { padding-left: 1ex }
</style>
Things get more complicated if we wish to do the same for PDF. We'll need convert the divs into a table, as that's the most effective way to get elements side-by-side. But that requires some heavier tools. In the YAML header, add
output:
pdf_document:
pandoc_args:
- "--lua-filter=columns-to-table.lua"
Then save the below code into a file column-to-table.lua.
function Div (div)
if div.classes:includes 'columns' then
local columns = div.content
:filter(function (x)
return x.classes and x.classes[1] == 'column'
end)
:map(function (x)
return x.content
end)
local aligns = {}
local widths = {}
local headers = {}
for i, k in ipairs(columns) do
aligns[i] = 'AlignDefault'
widths[i] = 0.98/ #columns
end
return pandoc.utils.from_simple_table(
pandoc.SimpleTable('', aligns, widths, headers, {columns})
)
end
end
You can get rid of the lines around the table by adding
\renewcommand\toprule[2]\relax
\renewcommand\bottomrule[2]\relax
at the beginning of your document.
---
title: "Untitled"
output: html_document
---
:::::::::::::: {.columns}
::: {.column width="50%"}
```{r warning=FALSE,message=FALSE}
library(dplyr)
mtcars %>% select(gear)
```
:::
::: {.column width="50%"}
```{r warning=FALSE,message=FALSE}
library(dplyr)
select(mtcars, gear)
```
:::
::::::::::::::
used This SO question as a resource. This is using pandoc to format the document in Rmarkdown HTML output
I am trying to scrape an html file structured as follow using beautifulsoup. Basicaly, each unit is constisted of:
one <h2></h2>
one <h3></h3>
more than one <p></p>
Something like follow:
<h2>January, 2020</h2>
<h3>facility</h3>
<p>text1-1</p>
<p>text1-2</p>
<h2>April, 2020</h2>
<h3>scientists</h3>
<p>text2-1</p>
<p>text2-2</p>
<h2>June, 2020</h2>
<h3>lawyers</h3>
<p>text3-1</p>
<h2>.....
I want to get text including the <p> tags between </h3> and the next <h2>. The result should be:
for row #1:
<p>text1-1</p>
<p>text1-2</p>
for row #2:
<p>text2-1</p>
<p>text2-2</p>
for row #3:
<p>text3-1</p>
Here is what I tried so far:
num_h2 = len(soup.find_all('h2'))
for i in range(0,num_h2):
print('---------')
print(i)
p_string = ''
sibling = soup.find_all('h3')[i].find_next_sibling('p').getText()
if sibling:
p_string += sibling
else:
break
print(p_string)
The problem with this solution is that it only shows the content of the first <p> under each unit. I do not know how to find how many <p> are there to generate a for loop. Also, is there a better way to do this than using find_next_silibing()?
Maybe css selectors can help:
for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)
Output:
<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>
In several web pages for courses I have a Resources page that lists some recommended
books, using icons of the cover and text that contains links to related materials.
I can do this, as shown below, but the code in my .Rmd file is unnecessarily complex, making it
a chore to add new books.
How can I simplify this code in the chunks? I use several functions defined below
```{r do-books, echo=FALSE}
width <- "160px"
tab(class="cellpadding", width="800px",
tr(
tabfig("fox", "images/books/car-3e.jpg",
"https://us.sagepub.com/en-us/nam/an-r-companion-to-applied-regression/book246125", width=width),
tabtxt("Fox & Weisberg,", a("An R Companion to Applied Regression",
href="https://us.sagepub.com/en-us/nam/an-r-companion-to-applied-regression/book246125"),
". A comprehensive introduction to linear models, regression diagnostics, etc.", br()
# "Course notes at", aself("http://ccom.unh.edu/vislab/VisCourse/index.html")
)
),
tr(
tabfig("Wickham", "images/books/ggplot-book-2ndEd.jpg",
"https://www.springer.com/gp/book/9780387981413", width=width),
tabtxt("Hadley Wickham,", a("ggplot2: Elegant Graphics for Data Analysis",
href="https://www.springer.com/gp/book/9780387981413"),
". The printed version of the ggplot2 book. The 3rd edition is online at",
a("https://ggplot2-book.org/", href= "https://ggplot2-book.org/")
)
)
)
```
The functions tabfig and tabtext are defined in a sourced file:
library(htmltools)
# table tags
tab <- function (...)
tags$table(...)
td <- function (...)
tags$td(...)
tr <- function (...)
tags$tr(...)
# an <a> tag with href as the text to be displayed
aself <- function (href, ...)
a(href, href=href, ...)
# thumnail figure with href in a table column / row
tabfig <- function(name, img, href, ...) {
td(
a(class = "thumbnail", title = name, href = href,
img(src = img, ...)
)
)
}
tabtxt <- function(text, ...) {
td(text, ...)
}
Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.
I've got a strange problem connected with content rendering.
I use following code to grab the content:
lib.otherContent = CONTENT
lib.otherContent {
table = tt_content
select {
pidInList = this
orderBy = sorting
where = colPos=0
languageField = sys_language_uid
}
renderObj = COA
renderObj {
10 = TEXT
10.field = header
10.wrap = <h2>|</h2>
20 = TEXT
20.field = bodytext
20.wrap = <div class="article">|</div>
}
}
and everything works fine, except that I'd like to use also predefined column-content templates other than simple text (Text with image, Images only, Bullet list etc.).
The question is: with what I have to replace renderObj = COA and the rest between the brackets to let the TYPO3 display it properly?
Thanks,
I.
The available cObjects are more or less listed in TSRef, chapter 8.
TypoScript for rendering Text w/image can be found in typo3/sysext/css_styled_content/static/v4.3/setup.txt at line 724, and in the neighborhood you'll find e.g. bullets (below) and image (above), which is referenced in textpic line 731. Variants of this is what you'll write in your renderObj.
You will find more details in the file typo3/sysext/cms/tslib/class.tslib_content.php, where e.g. text w/image is found at or around line 897 and is called IMGTEXT (do a case-sensitive search). See also around line 403 in typo3/sysext/css_styled_content/pi1/class.cssstyledcontent_pi1.php, where the newer css-based rendering takes place.