I have a multi-page PDF document, and I want a version of the document with one of the pages in the middle with a scanned-in copy, as I needed a physical signature in the document. How can I do this with ImageMagick? I'm aware that ImageMagick may not necessarily be the best tool for the job. However, the resulting PDF does not need to be high quality or a high fidelity copy, so it should be sufficient for my needs.
As a specific example, I have a 9 page my-file.pdf, and I want to create a copy of the PDF with the 8th page replaced with page-8.png. It looks like I should be able to achieve this goal with the convert tool, though it's not immediately obvious what the syntax would be. How can I achieve this goal?
If I merely wanted to append the new page to the end of the file, I know I can do the following:
convert my-file.pdf page-8.png output-file.pdf
However, this end up with the original pages 1-9, then the new page 8. What I actually want is to replace the original page 8 with the new page 8. My desired output is:
[original pages 1 - 7],[new page 8],[original page 9]
A specific page or range of pages can be specified using the bracket syntax with zero-based indexing. For instance, [8] will refer to the ninth page, and [0-6] to the first seven pages. Using this, a duplicate of the PDF with the 8th page replaced can be achieved as follows:
convert my-file.pdf[0-6] page-8.png my-file.pdf[8] output-file.pdf
As indicated in the question, as well as a comment, Imagemagick is not a great tool for this task, as it rasterizes the output of the PDF. An alternate solution that avoids this would be to use pdftk to replace the page. While the inserted page will still be an image, the replaced pages will not be rasterized.
First, save the scanned-in page as a PDF using an application capable of this. Then, use the pdftk cat operation to combine the PDFs:
pdftk A=my-file.pdf B=page-8.pdf cat A1-7 B A9 output output-file.pdf
Related
I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf
There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -
I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.
The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables
I am referring to this Suppress automatic figure numbering in pdf output with r markdown/knitr
which I don't think was answered fully.
Essentially, I am using knitr::spin and rmarkdown to produce word, pdf and html documents.
For word, there appears to be no numbering when one puts in
+fig.1, fig.cap = "Figure name"
You only get an output Figure name in the caption.
To solve that, I used captioner class.
figs = captioner("Figure")
That works fine for word
But I am not faced with rewriting the script for pdf document as the caption turns up as figure 1: figure 1: The name
I am using knitr::spin to actually generate the RMD document for forward outputs in word and pdf.
I am not sure I can use hooks in knitr::spin, as I have tried it as advertised but can't get it to work.
I also tried
header-includes: \usepackage{caption} \usepackage{float}
\captionsetup[table]{labelformat=empty}
\captionsetup[figure]{labelformat=empty}
as suggested somehere to surpress the prefix for pdf but I get errors from pandoc. It uses pdf2latex.
I am not sure how one would query the output format in knitr::spin to actually produce different actions for different formats which could be a solution although cumbersome.
Thank you so much for your help from a novice.
This may be a stupid question, but I can't figure it out.
I have made some tables in SPSS. Now I want them over to my latex document.
What I do, it that I right-click the table in SPSS, and press export.
Here I can choose between PDF or .doc. BUT the PDF-file created, generates a file with the table on top of a page (A4 size, with "page 1" at the bottom). I do not want this, I only want the table.
example how it turns out:
Example how I want it to turn out:
If I export to word, I can further save as PDF, but same problem occurs.
Screenshot works, but does not give me the same picture-quality that I prefer.
Do anyone of you have any tips for me?
Thanks :)
Unfortunately SPSS does not provide native table export to Latex. It does provide table export to html and xls, which can post-hoc be converted to Tex tables. PDF output for everything forces to export the full page (very annoying for graphics as well) - but you probably don't want to insert the image of the table (you could crop the PDF if need be), but have a Tex table (in the same font) as your document anyway.
One thing I have done in the past to make the export to text tables with specific markup is to use the PRINT or LIST commands to print the text table to the output (or to a text file) that is closer to the end goal. In this NABBLE post I have some syntax that makes pandoc flavored pipe style markdown tables - it should be pretty clear how that same approach could be used for Tex tables (actually Tex tables should be much simpler).
Here is an example of some code using LIST to make a the markup closer to Tex tables.
DATA LIST FREE / Variable (A1) Mean Median (2F4.2).
BEGIN DATA
A 3.25 2.00
B 2.56 2.50
C 9.87 10.20
END DATA.
*Using LIST to make Latex style table.
STRING Mid (A1) End (A2).
COMPUTE Mid = "&".
COMPUTE End = "//".
LIST /VARIABLES = Variable Mid Mean Mid Median End.
And here is a screen shot of the produced output on my machine.
So here I would still have to copy-paste the text output into my Tex document, (and make the header row).
You can also use OMS to save designed items in a variety of formats, including XML and then use an xml-to-Latex tool such as xmltex. You could probably even generate such a conversion with XSLT from the XML.
From the Viewer, you could also retrieve the table with Python scripting and use a Python-based converter tool.
I am trying to extract a range of pages from a multipage pdf file into individual jpegs using convert (Imagemagick). The extraction works fine. What I am stuck on is that if I want to extract page range 10-20, I still get out jpeg files with names page-0.jpeg to page-9.jpeg while I want them to be named page-10.jpeg to page-20.jpeg. Is there a way of specifying that on the command line?
I require this since I want to extract pages in chucks of 10 to avoid eating up too much memory for huge pdf files and don't want to keep renaming the files.
I remember having this working in an earlier project but can't figure out what I am missing now.
Finally managed to do this. Leaving a answer in case somebody else is looking for the same. The solution works with Imagemagick 6.5.1.
So we want to extract page numbered i to j from a.pdf into individual jpegs with files named from a-10.jpeg to a-20.jpeg.
convert a.pdf[i-j] -set filename:page "%[fx:t+i]" a-%[filename:page].jpeg
This uses fx operators. fx:t gives the screen number of current image in sequence and we can add our offset to it.
You can specify the first "page" number used by %d in the output filename by adding the -scene n parameter, e.g.:
convert a.pdf[0-9] -scene 10 a-%d.jpeg
will output a-10.jpeg, a-11.jpeg, etc.