Camelot in python does not behave as expected - python-camelot

I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.

The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables

Related

How can I replace a single PDF page using Imagemagick?

I have a multi-page PDF document, and I want a version of the document with one of the pages in the middle with a scanned-in copy, as I needed a physical signature in the document. How can I do this with ImageMagick? I'm aware that ImageMagick may not necessarily be the best tool for the job. However, the resulting PDF does not need to be high quality or a high fidelity copy, so it should be sufficient for my needs.
As a specific example, I have a 9 page my-file.pdf, and I want to create a copy of the PDF with the 8th page replaced with page-8.png. It looks like I should be able to achieve this goal with the convert tool, though it's not immediately obvious what the syntax would be. How can I achieve this goal?
If I merely wanted to append the new page to the end of the file, I know I can do the following:
convert my-file.pdf page-8.png output-file.pdf
However, this end up with the original pages 1-9, then the new page 8. What I actually want is to replace the original page 8 with the new page 8. My desired output is:
[original pages 1 - 7],[new page 8],[original page 9]
A specific page or range of pages can be specified using the bracket syntax with zero-based indexing. For instance, [8] will refer to the ninth page, and [0-6] to the first seven pages. Using this, a duplicate of the PDF with the 8th page replaced can be achieved as follows:
convert my-file.pdf[0-6] page-8.png my-file.pdf[8] output-file.pdf
As indicated in the question, as well as a comment, Imagemagick is not a great tool for this task, as it rasterizes the output of the PDF. An alternate solution that avoids this would be to use pdftk to replace the page. While the inserted page will still be an image, the replaced pages will not be rasterized.
First, save the scanned-in page as a PDF using an application capable of this. Then, use the pdftk cat operation to combine the PDFs:
pdftk A=my-file.pdf B=page-8.pdf cat A1-7 B A9 output output-file.pdf

Convert pdf to docx without changing font-size, bold,etc in python3.x

I want to convert PDF to docx using python3.x in ubuntu16.x. I gone through the code given below:
for top, dirs, files in os.walk(pdfdir):
for filename in files:
if filename.endswith('.pdf'):
i = i + 1
abspath_pdf = os.path.normpath(os.path.join(top, filename))
print 'Converting {0} into .doc format..'.format(abspath_pdf)
subprocess.call('{0} --invisible --convert-to doc{1} --outdir "{2}" "{3}"'
.format(lowriter, outfilter, docdir, abspath_pdf), shell=True)
But its not working for me. Can anyone help for this?
Thanks in advance.
You can use Aspose.Words Cloud to convert PDF to MS Word formats
https://products.aspose.cloud/words/python
You should also note that PDF format is fixed page format, and MS Words formats are flow formats. This makes conversion from PDF to MS Word quite difficult task. Aspose.Words Cloud does recognition of elements in PDF so the output is editable in MS Word. See the following link to learn more about PDF to Word conversion https://docs.aspose.cloud/display/wordscloud/Convert+PDF+Document+to+Word

Knitr Spin and Rmarkdown Fig.cap (figure caption). Producing double numbering pdf document

I am referring to this Suppress automatic figure numbering in pdf output with r markdown/knitr
which I don't think was answered fully.
Essentially, I am using knitr::spin and rmarkdown to produce word, pdf and html documents.
For word, there appears to be no numbering when one puts in
+fig.1, fig.cap = "Figure name"
You only get an output Figure name in the caption.
To solve that, I used captioner class.
figs = captioner("Figure")
That works fine for word
But I am not faced with rewriting the script for pdf document as the caption turns up as figure 1: figure 1: The name
I am using knitr::spin to actually generate the RMD document for forward outputs in word and pdf.
I am not sure I can use hooks in knitr::spin, as I have tried it as advertised but can't get it to work.
I also tried
header-includes: \usepackage{caption} \usepackage{float}
\captionsetup[table]{labelformat=empty}
\captionsetup[figure]{labelformat=empty}
as suggested somehere to surpress the prefix for pdf but I get errors from pandoc. It uses pdf2latex.
I am not sure how one would query the output format in knitr::spin to actually produce different actions for different formats which could be a solution although cumbersome.
Thank you so much for your help from a novice.

Extract sections of PDF

I am trying to extract sections of a PDF file, for use in text analysis. I tried using pdfextract to accomplish this. However, a command such as
pdf-extract extract --regions --no-lines Bauer2010.pdf
only extract the (x,y) coordinates of a region, as in the example below.
<region x="226.32" y="750.47" width="165.57" height="6.37"
line_height="6.37" font="BGBFHO+AdvP4DF60E">Patient Education and
Counseling 79 (2010) 315-319</region>
Can sections of a PDF be extracted?
Have a look at http://text-analyzer.com where you can upload your PDF file and it will convert it into a format suitable for Natural Language Processing. Once converted into a text file it can then process the file, breaking it down into sentences with sentiment analysis. It has over 40 different types of sentence views where you can tag sections. Those tagged sentences can be exported.

SPSS tables to latex (PDF) without creating an A4 page

This may be a stupid question, but I can't figure it out.
I have made some tables in SPSS. Now I want them over to my latex document.
What I do, it that I right-click the table in SPSS, and press export.
Here I can choose between PDF or .doc. BUT the PDF-file created, generates a file with the table on top of a page (A4 size, with "page 1" at the bottom). I do not want this, I only want the table.
example how it turns out:
Example how I want it to turn out:
If I export to word, I can further save as PDF, but same problem occurs.
Screenshot works, but does not give me the same picture-quality that I prefer.
Do anyone of you have any tips for me?
Thanks :)
Unfortunately SPSS does not provide native table export to Latex. It does provide table export to html and xls, which can post-hoc be converted to Tex tables. PDF output for everything forces to export the full page (very annoying for graphics as well) - but you probably don't want to insert the image of the table (you could crop the PDF if need be), but have a Tex table (in the same font) as your document anyway.
One thing I have done in the past to make the export to text tables with specific markup is to use the PRINT or LIST commands to print the text table to the output (or to a text file) that is closer to the end goal. In this NABBLE post I have some syntax that makes pandoc flavored pipe style markdown tables - it should be pretty clear how that same approach could be used for Tex tables (actually Tex tables should be much simpler).
Here is an example of some code using LIST to make a the markup closer to Tex tables.
DATA LIST FREE / Variable (A1) Mean Median (2F4.2).
BEGIN DATA
A 3.25 2.00
B 2.56 2.50
C 9.87 10.20
END DATA.
*Using LIST to make Latex style table.
STRING Mid (A1) End (A2).
COMPUTE Mid = "&".
COMPUTE End = "//".
LIST /VARIABLES = Variable Mid Mean Mid Median End.
And here is a screen shot of the produced output on my machine.
So here I would still have to copy-paste the text output into my Tex document, (and make the header row).
You can also use OMS to save designed items in a variety of formats, including XML and then use an xml-to-Latex tool such as xmltex. You could probably even generate such a conversion with XSLT from the XML.
From the Viewer, you could also retrieve the table with Python scripting and use a Python-based converter tool.