ways to separate passages in pdf using gap? - pdf

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf

There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
   
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

How can I replace a single PDF page using Imagemagick?

I have a multi-page PDF document, and I want a version of the document with one of the pages in the middle with a scanned-in copy, as I needed a physical signature in the document. How can I do this with ImageMagick? I'm aware that ImageMagick may not necessarily be the best tool for the job. However, the resulting PDF does not need to be high quality or a high fidelity copy, so it should be sufficient for my needs.
As a specific example, I have a 9 page my-file.pdf, and I want to create a copy of the PDF with the 8th page replaced with page-8.png. It looks like I should be able to achieve this goal with the convert tool, though it's not immediately obvious what the syntax would be. How can I achieve this goal?
If I merely wanted to append the new page to the end of the file, I know I can do the following:
convert my-file.pdf page-8.png output-file.pdf
However, this end up with the original pages 1-9, then the new page 8. What I actually want is to replace the original page 8 with the new page 8. My desired output is:
[original pages 1 - 7],[new page 8],[original page 9]
A specific page or range of pages can be specified using the bracket syntax with zero-based indexing. For instance, [8] will refer to the ninth page, and [0-6] to the first seven pages. Using this, a duplicate of the PDF with the 8th page replaced can be achieved as follows:
convert my-file.pdf[0-6] page-8.png my-file.pdf[8] output-file.pdf
As indicated in the question, as well as a comment, Imagemagick is not a great tool for this task, as it rasterizes the output of the PDF. An alternate solution that avoids this would be to use pdftk to replace the page. While the inserted page will still be an image, the replaced pages will not be rasterized.
First, save the scanned-in page as a PDF using an application capable of this. Then, use the pdftk cat operation to combine the PDFs:
pdftk A=my-file.pdf B=page-8.pdf cat A1-7 B A9 output output-file.pdf

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website
(https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).
The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells).
I want to extract most if not all tables from this document.
I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).
Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white
circles, which is odd since it does seems to recognize exotic characters on other columns.
Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.
I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.
library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

SPSS tables to latex (PDF) without creating an A4 page

This may be a stupid question, but I can't figure it out.
I have made some tables in SPSS. Now I want them over to my latex document.
What I do, it that I right-click the table in SPSS, and press export.
Here I can choose between PDF or .doc. BUT the PDF-file created, generates a file with the table on top of a page (A4 size, with "page 1" at the bottom). I do not want this, I only want the table.
example how it turns out:
Example how I want it to turn out:
If I export to word, I can further save as PDF, but same problem occurs.
Screenshot works, but does not give me the same picture-quality that I prefer.
Do anyone of you have any tips for me?
Thanks :)
Unfortunately SPSS does not provide native table export to Latex. It does provide table export to html and xls, which can post-hoc be converted to Tex tables. PDF output for everything forces to export the full page (very annoying for graphics as well) - but you probably don't want to insert the image of the table (you could crop the PDF if need be), but have a Tex table (in the same font) as your document anyway.
One thing I have done in the past to make the export to text tables with specific markup is to use the PRINT or LIST commands to print the text table to the output (or to a text file) that is closer to the end goal. In this NABBLE post I have some syntax that makes pandoc flavored pipe style markdown tables - it should be pretty clear how that same approach could be used for Tex tables (actually Tex tables should be much simpler).
Here is an example of some code using LIST to make a the markup closer to Tex tables.
DATA LIST FREE / Variable (A1) Mean Median (2F4.2).
BEGIN DATA
A 3.25 2.00
B 2.56 2.50
C 9.87 10.20
END DATA.
*Using LIST to make Latex style table.
STRING Mid (A1) End (A2).
COMPUTE Mid = "&".
COMPUTE End = "//".
LIST /VARIABLES = Variable Mid Mean Mid Median End.
And here is a screen shot of the produced output on my machine.
So here I would still have to copy-paste the text output into my Tex document, (and make the header row).
You can also use OMS to save designed items in a variety of formats, including XML and then use an xml-to-Latex tool such as xmltex. You could probably even generate such a conversion with XSLT from the XML.
From the Viewer, you could also retrieve the table with Python scripting and use a Python-based converter tool.