.docx file chapter extraction - indexing

I would like to extract the content of a .docxfile, chaptervise.
So, my .docxdocument has a register and every chapter has some content
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
So finally it would be great to receive a Nx3 matrix, containing the index number, the index name and at least the content.
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
Thanks for your help

You could export or copy-paste your .docx in a .txt and apply this R script :
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
It doesn't work if your document contains any sort of line which is not a heading beginning with a number (eg, footnotes or numbered lists)
(please, no mindless downvote : this script works perfectly with the example given)

Related

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf
There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
   
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -

Struggling with PDF output of bookdown

I thought it would be a good idea to write a longer report/protocol using bookdown since it's more comfortable to have one file per topic to write in instead of just one RMarkdown document with everything. Now I'm faced with the problem of sharing this document - the HTML looks best (except for wide tables being cut off) but is difficult to send via e-mail to a supervisor for example. I also can't expect anyone to be able to open the ePub format on their computer, so PDF would be the easiest choice. Now my problems:
My chapter headings are pretty long, which doesn't matter in HTML but they don't fit the page headers in the PDF document. In LaTeX I could define a short title for that, can I do that in bookdown as well?
I include figure files using knitr::include_graphics() inside of code chunks, so I generate the caption via the chunk options. For some figures, I can't avoid having an underscore in the caption, but that does not work out in LaTeX. Is there a way to escape the underscore that actually works (preferrably for HTML and PDF at the same time)? My LaTeX output looks like this after rendering:
\textbackslash{}begin\{figure\}
\includegraphics[width=0.6\linewidth,height=0.6\textheight]{figures/0165_HMMER} \textbackslash{}caption\{Output of HMMER for PA\_0165\}\label{fig:0165}
\textbackslash{}end\{figure\}
Edit
MWE showing that the problem is an underscore in combination with out.height (or width) in percent:
---
title: "MWE FigCap"
author: "LilithElina"
date: "19 Februar 2020"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE, fig.cap="This is a nice figure caption", out.height='40%'}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
```{r pressure2, echo=FALSE, fig.cap="This is a not nice figure_caption", out.height='40%'}
plot(pressure)
```
Concerning shorter headings: pandoc, which is used for the markdown to LaTeX conversion, does not offer a "shorter heading". You can do that yourself, though:
# Really long chaper heading
\markboth{\thechapter~short heading}{}
[...]
## Really long section heading
\markright{\thesection~short heading}
This assumes a document class with chapters and sections.
Concerning the underscore in the figure caption: For me it works for both PDF and HTML to escape the underscore:
```{r pressure2, echo=FALSE, fig.cap="This is a not nice figure\\_caption", out.height='40%'}
plot(pressure)
```

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website
(https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).
The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells).
I want to extract most if not all tables from this document.
I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).
Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white
circles, which is odd since it does seems to recognize exotic characters on other columns.
Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.
I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.
library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")

What do the ASCII characters preceding a carriage return represent in a PDF page?

This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.

The separator between keywords in PDF meta data

I cannot find an "official" documentation on whether the keywords and keyword phrases in the meta data of a PDF file are to be separated by a comma or by a comma with space.
The following example demonstrates the difference:
keyword,keyword phrase,another keyword phrase
keyword, keyword phrase, another keyword phrase
Any high-quality references?
The online sources I found are of low quality.
E.g., an Adobe press web page says "keywords must be separated by commas or semicolons", but in the example we see a semicolon with a following space before the first keyword and a semicolon with a following space between each two neighbor keywords. We don't see keyword phrases in the example.
The keywords metadata field is a single text field - not a list. You can choose whatever is visually pleasing to you. The search engine which operates on the keyword data may have other preferences, but I would imagine that either comma or semicolon would work with most modern search engines.
Reference: PDF 32000-1:2008 on page 550 at 1. Adobe; 2. The Internet Archive
ExifTool, for example parses for comma separated values, but if it does not find a comma it will split on spaces:
# separate tokens in comma or whitespace delimited lists
my #values = ($val =~ /,/) ? split /,+\s*/, $val : split ' ', $val;
foreach $val (#values) {
$et->FoundTag($tagInfo, $val);
}
I dont have a "high-quality references" but, if i generated a pdf using latex i do it in the following way:
adding in my main.tex following line:
\usepackage[a-1b]{pdfx}
then i write a file main.xmpdata and add this lines:
\Title{My Title}
\Author{My Name}
\Copyright{Copyright \copyright\ 2018 "My Name"}
\Kewords{KeywordA\sep
KeywordB\sep
KeywordC}
\Subject{My Short Discription}
after generating the pdf with pdflatex i used a python script based on "pdfminer.six" to extract the metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open('main.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
if 'Metadata' in doc.catalog:
metadata = resolve1(doc.catalog['Metadata']).get_data()
print(metadata) # The raw XMP metadata
The part with the Keywords then look like this:
...<rdf:Bag><rdf:li>KeywordA</rdf:li>\n <rdf:li>KeywordB...
and looking with "Adobe Acrobat Reader DC" at the properties of "main.pdf" i find in the properties the following entry in the section keywords:
;KeywordA;KeywordB;KeywordC
CommonLook claim to be "a global leader in electronic document accessibility, providing software products and professional services enabling faster, more cost-efficient, and more reliable processes for achieving compliance with the leading PDF and document accessibility standards, including WCAG, PDF/UA and Section 508."
They provide the following advice on PDF metadata:
Pro Tip: When you’re entering Keywords into the metadata, separate
them with semicolons as opposed to commas.
although give no further reasoning as to why this is the preferred choice.