Is there a programmatic way to find a specific word containing a strike-through in a PDF file?
For example: test
I've tried converting the PDF to txt using the following (ghostscript) command:
gswin64.exe -sDEVICE=txtwrite -o output.txt "C:\docs\Letter.pdf"
But the word 'test' just appears as normal in output.txt
Any suggestions would be greatly appreciated
Related
I have a plain text file containing ansi escape codes for colouring text. If it matters, I generated this file using the python package colorama.
How can convert this file to pdf with colors properly rendered? I was hoping for a simple solution using e.g. pandoc, but any other command line or programmatic solutions would be ok as well. Example:
echo -e '\e[0;31mHello World\e[0m' > example.txt
# Convert to pdf
pandoc example.txt -o example.pdf # Gives error
I have some pdf files
Lettera_Contributi_201701-1.pdf
Lettera_Contributi_201701-2.pdf
Lettera_Contributi_201701-3.pdf
so on...
and I'd like to merge only their 2nd pages in one pdf file. I've tried the following pdftk command with a list of file example
pdftk *.pdf cat 2 output test.pdf
but the result I get in test.pdf is just the a.pdf's 2nd page..
Any ideas?
$ pdftk *.pdf cat 2 output test.pdf verbose
Command Line Data is valid.
Input PDF Filenames & Passwords in Order
( <filename>[, <password>] )
Lettera_Contributi_201701-1.pdf
Lettera_Contributi_201701-2.pdf
Lettera_Contributi_201701-3.pdf
Lettera_Contributi_201701-4.pdf
Lettera_Contributi_201701-5.pdf
Lettera_Contributi_201701-6.pdf
The operation to be performed:
cat - Catenate given page ranges into a new PDF.
The output file will be named:
test.pdf
Output PDF encryption settings:
Output PDF will not be encrypted.
No compression or uncompression being performed on output.
Creating Output ...
Adding page 2 X0X from Lettera_Contributi_201701-1.pdf
You may do it in two steps using 'find':
1) find all source PDFs in a current folder and execute 'pdftk' on everyone of them:
find . -name \*pdf -exec pdftk A={} cat A2 output {}_2 \;
( Above command finds all the files which have names ending with "pdf" and runs a command given after -exec. Brackets { } are substituted with a name of each file that was found. )
You'll get a set of new PDFs containing only a second page each. They will be named like: original_filename.pdf_2
e.g.
file1.pdf_2
file2.pdf_2
file3.pdf_2
2) now you can merge all the new PDFs:
pdftk \*pdf_2 output out.pdf
You will get out.pdf containing all the second pages of original PDFs.
I have a job which creates N pdf files in a folder, I am merging these files using pdftk. Usually each pdf file is of two pages.but the problem is, sometimes the pdf may get to 3 pages, so i need to insert a blank page after each pdf, so that i can differentiate them.
is there any a way to achieve this.
currently i am using the folowing script to merge pdfs.
`
#echo off
pdftk *.pdf cat output Avinash.pdf
ren Avinash.pdf Avinash.doc
del *.pdf
ren Avinash.doc Avinash.pdf
`
I don't have much experience with ocr. Here's what I try:
tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf
The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf.
My question is: can I get this layout as a file (hocr or html)?
(Config parameters preferred, not API.)
What I've tried:
tesseract -l eng -psm 1 image_str007_0001.jpg output hocr
and
hocr2pdf -i image_str007_001 -o output.pdf < output.hocr
In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?
I would like to run a batch conversion in a folder with full of pdf files. I have using xPDF and this is the command prompt for a single file:
c:\Test\pdftotext -layout firstpdftoconvert.pdf firstpdfconverted.txt
Could somebody help please to do it in one go (convert all the pdf files only) using a batch file? Thanks in advance!
Combining your question with this answer iterating over files of a directory:
for /r %i in (*.pdf) do "c:\Test\pdftotext" -layout "%i"
This will work on all pdf files in the current directory.
Be sure to double the % signs if you run this from a batch file.