I'm trying to convert a rather long docx file to a pdf via pandoc, and was wondering if there's any way that I can get pandoc to automatically use multi-line tables in the output pdf. I know that there's a pandoc multiline table markdown extension, so would there be anyway to convert a docx file into a multiline-markdown-compliant format, or just convert it like this to a pdf directly?
On Windows you would normally print the DOCX to a PDF printer.
That is the easiest and most precise way to convert from DOCX to PDF.
I wanted to have RST as the main format.
Therefore I converted some DOCX to restructuredText (RST) with pandoc.
Table entries were sometimes made multi-line, where they were not,
breaking words.
I converted them to list-table with this script: listtable.py
(0 in the join param string brings together lines in that position=column without space)
From RST to HTML, PDF, DOCX I use this script:
dcx.py
In the Makefile you will see the pandoc command to convert to pdf.
I doubt, a direct conversion using pandoc would bring good results.
Use an intermediate (light) markup text format and edit that until the conversion to pdf fits your expectations.
This makes only sense if you want a pure text version anyway.
Related
I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.
The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables
I m using pandoc to convert docx file to pdf.
Command I'm using for converting:
pandoc -o document.pdf document.docx
The problem is that after the convert, the styling disappears.
As you can see, The colors are missing and my table's position is moved to right
Pandoc does not convert styles/layout/formatting.
From https://pandoc.org/MANUAL.html:
Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size.
I want to convert PDF to docx using python3.x in ubuntu16.x. I gone through the code given below:
for top, dirs, files in os.walk(pdfdir):
for filename in files:
if filename.endswith('.pdf'):
i = i + 1
abspath_pdf = os.path.normpath(os.path.join(top, filename))
print 'Converting {0} into .doc format..'.format(abspath_pdf)
subprocess.call('{0} --invisible --convert-to doc{1} --outdir "{2}" "{3}"'
.format(lowriter, outfilter, docdir, abspath_pdf), shell=True)
But its not working for me. Can anyone help for this?
Thanks in advance.
You can use Aspose.Words Cloud to convert PDF to MS Word formats
https://products.aspose.cloud/words/python
You should also note that PDF format is fixed page format, and MS Words formats are flow formats. This makes conversion from PDF to MS Word quite difficult task. Aspose.Words Cloud does recognition of elements in PDF so the output is editable in MS Word. See the following link to learn more about PDF to Word conversion https://docs.aspose.cloud/display/wordscloud/Convert+PDF+Document+to+Word
I'm parsing PDF file. I decoded all streams, got text from text objects and ToUnicode CMaps. But I don't know, when do I need replace symbols from text to symbols(strings) from ToUnicode CMaps.
When I see some like <01239099> I use this convert table and all is OK. But some files need, that I use convert table, while parsing simple text like
[(.&)-2(.K)-5(.D)-8(.S)], then all is OK too.
Does somebody know rule, which symbols need replace using convert table?
I'm using pdftk within my application to generate a composite pdf of several individual pdf files. When, for whatever reason, pdftk does run.
If feel that the reason is the amount of Input files e.g. 586 or the Character Length of Input pdf files - in my case 35624.
What are the limits for Input files in pdftk?
Thank you for answer