Details about "Number of Input pdf" or "Input PDF Character Len" - pdf

I'm using pdftk within my application to generate a composite pdf of several individual pdf files. When, for whatever reason, pdftk does run.
If feel that the reason is the amount of Input files e.g. 586 or the Character Length of Input pdf files - in my case 35624.
What are the limits for Input files in pdftk?
Thank you for answer

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

Wrapping page-width-exceeding tables in Pandoc (docx->pdf)

I'm trying to convert a rather long docx file to a pdf via pandoc, and was wondering if there's any way that I can get pandoc to automatically use multi-line tables in the output pdf. I know that there's a pandoc multiline table markdown extension, so would there be anyway to convert a docx file into a multiline-markdown-compliant format, or just convert it like this to a pdf directly?
On Windows you would normally print the DOCX to a PDF printer.
That is the easiest and most precise way to convert from DOCX to PDF.
I wanted to have RST as the main format.
Therefore I converted some DOCX to restructuredText (RST) with pandoc.
Table entries were sometimes made multi-line, where they were not,
breaking words.
I converted them to list-table with this script: listtable.py
(0 in the join param string brings together lines in that position=column without space)
From RST to HTML, PDF, DOCX I use this script:
dcx.py
In the Makefile you will see the pandoc command to convert to pdf.
I doubt, a direct conversion using pandoc would bring good results.
Use an intermediate (light) markup text format and edit that until the conversion to pdf fits your expectations.
This makes only sense if you want a pure text version anyway.

Extract sections of PDF

I am trying to extract sections of a PDF file, for use in text analysis. I tried using pdfextract to accomplish this. However, a command such as
pdf-extract extract --regions --no-lines Bauer2010.pdf
only extract the (x,y) coordinates of a region, as in the example below.
<region x="226.32" y="750.47" width="165.57" height="6.37"
line_height="6.37" font="BGBFHO+AdvP4DF60E">Patient Education and
Counseling 79 (2010) 315-319</region>
Can sections of a PDF be extracted?
Have a look at http://text-analyzer.com where you can upload your PDF file and it will convert it into a format suitable for Natural Language Processing. Once converted into a text file it can then process the file, breaking it down into sentences with sentiment analysis. It has over 40 different types of sentence views where you can tag sections. Those tagged sentences can be exported.

Are byte order marks allowed in PDF document?

I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :
25 50 44 46 2D
This works fine except that I'm seeing a few files that starts with a byte order mark instead:
EF BB BF 25 50 44 46 2D
^-------^
I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7
Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.
So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)
So: are BOM markers allowed at the start of PDF documents ?
No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.
But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.
Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:
3.4.1, “File Header”
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
...
3.4.4, “File Trailer”
Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.
And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.
No, a BOM is not valid at the front a PDF file.
A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.
I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.

Output filenames when extracting a range of pages from pdf into jpeg using Imagemagick

I am trying to extract a range of pages from a multipage pdf file into individual jpegs using convert (Imagemagick). The extraction works fine. What I am stuck on is that if I want to extract page range 10-20, I still get out jpeg files with names page-0.jpeg to page-9.jpeg while I want them to be named page-10.jpeg to page-20.jpeg. Is there a way of specifying that on the command line?
I require this since I want to extract pages in chucks of 10 to avoid eating up too much memory for huge pdf files and don't want to keep renaming the files.
I remember having this working in an earlier project but can't figure out what I am missing now.
Finally managed to do this. Leaving a answer in case somebody else is looking for the same. The solution works with Imagemagick 6.5.1.
So we want to extract page numbered i to j from a.pdf into individual jpegs with files named from a-10.jpeg to a-20.jpeg.
convert a.pdf[i-j] -set filename:page "%[fx:t+i]" a-%[filename:page].jpeg
This uses fx operators. fx:t gives the screen number of current image in sequence and we can add our offset to it.
You can specify the first "page" number used by %d in the output filename by adding the -scene n parameter, e.g.:
convert a.pdf[0-9] -scene 10 a-%d.jpeg
will output a-10.jpeg, a-11.jpeg, etc.