Extract table data from PDF [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Is there any consistent way to extract tables from PDF files? Any tools?
What I have done so far:
I have tried out pdftotext tool. It has an option to convert to HTML layout.
What is the problem with this:
The table information is not preserved in HTML output
I expected <table> tags, but everything was under <p> tags.
Will there be any markers in a PDF document to indicate table structures? Like <table>, <tr> and <td> in HTML?
If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.

What you could do however, is use the pdftotext -layout input.pdf output.txt.
It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.
If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get.
I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.
There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.

If the PDF document misses information that marks content as table, row, cell, etc. (known as tags), then there is no consistent way to extract tables from the PDF document. Mostly, PDF documents do not contain these tags. These tags typically serve to make a PDF accessible so that it can for example be read aloud. These tags are not required for a PDF to be valid.

Related

What documentation system to use for user guide [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm currently trying to find a documentation (user guide) system that would have following features:
documentation files in text mode (so svn could diff/merge it)
possibility to use images, table, cross-references and table of
contents
export to pdf (or .doc/.odt) that would support cross-references
I tried markdown for documentation source files and pandoc for pdf export but markdown does not support tables.
I really appreciate any help you can provide.
We use Sphinx for this scenario.
It can generate html, pdf and some other formats from reStructuredText Files.
And have a look at list-table when you want to add complex tables.
I use the TeX for electonic and printable documentation
https://tex.stackexchange.com/
https://en.wikipedia.org/wiki/TeX
Probably the most commonly used solution set for documentation is XML in Docbook or DITA. You can certainly manage those in SVN as well as perform diffs. They both provide processing toolchains with many output types include PDF through XSL FO.

Tools to display code in PDF [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm looking for a tool to display code examples in PDF file. I mean that I would like to colorize and indent code in my PDF (it's for lessons).
I'm not able to find anything on the web or on StackOverflow. It's full of tutorials to use code to make PDF but not to display code in PDF. When I search for 'display' it gives me how to display PDF in web/applications.
Sorry to disappoint you, but:
There is no such thing as you are looking for!
If you want code samples on a PDF page to be syntax highlighted, you must look for a tool that does do this within the source document which was used to generate the PDF file from.
There is no tool in the world, neither Free and Open Source Software, nor commercial payware, that lets you edit a PDF and convert the source samples on its pages into properly syntax highlighted parts. (The only thing you can possibly do on this level is adding specific comments -- here you have to manually highlight specific words or sentences with a background color of your choice.)
If you are looking for a toolchain that makes it easy to generate PDFs from scratch containing syntax-highlighted code samples, look at:
Markdown: a very lean text markup language to write the document in (use any text editor you like)
Pandoc: a powerfull Markdown-to-Anything converter. It's a command line tool available for all major OS platforms. Its output may be PDF, HTML, EPUB, LaTeX (all of the previous with syntax highlighting), as well as ODT, DOCX, DocBook (no syntax highlighting supported so far for the last few) and a few more...

Markdown for automatic doc generation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've use javadoc, as well as a variety of different XML based doc-generation systems. Javadoc is fine, XML based doc-generators are hideous, with the XML getting all over the comments and turning the comments into soup.
I've looked at markdown, and the fact that it is easily parseable into structured data but also super human-readable would make it perfect for in-code comments, where the readability of both the docs and the plaintext is of utmost importance.
Are there any markdown based doc-generators out there already? Is there any reason why it wouldn't work which I don't know of?
There exits some Markdown-Doclets (f.ex. http://www.richardnichols.net/open-source/markdown-doclet/ ) which can be used with JavaDoc.
Maybe you are also interested in the famous doxygen tool. It doesn't use Markdown but the format is very similar to it (f. ex. unordered lists with - etc.).
You may try mdoc to generate Markdown based documentation. It reads all the .md files and produces HTML based documentation. It also creates a TOC. Check it out.

Free library to read PDF files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a free way to read PDF files through VBA to extract basic text content? I need to automate a weekly data acquisition process at my company where data is contained in PDF files (which are updated weekly by the data provider). Also, is there a reference I can look into to understand the file structure (DOM?) of a PDF?
Adobe's PDF reference is online here: http://www.adobe.com/devnet/pdf/pdf_reference.html
I'm not sure about the best way to read PDFs from VBA directly, but if you can call an external Java or C# program, then I would recommend using iText for basic text extraction.
EDIT: I should maybe mention that Adobe's PDF reference is an 800 page beast. I found that it's good for looking up answers to particular questions (eg, storing widths of embedded truetype fonts), but it may not be a good place to start. For that, reading through the iText book helped me to get started on the format.
The IText book contains lots of worked examples for general PDF tasks and lots of background info to help you understand PDF files. It more than pays for itself very quickly!

Programmatically extracting Adobe PDF package files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We have a bunch of documents in our organization that were inadvertently saved as Adobe PDF packages (also known as PDF 1.7 "collections").
We would like to convert these to normal PDFs (most of these "packages" contain one bog-standard pdf file), but given the number of files, it's not possible manually.
Any Adobe expert know whether:
There is an open-source or free library that handles PDF package format that I can write a script around?
Does Adobe Pro 9 have a relevant scriptable interface that would allow me to extract the relevant file from each package?
Alternatively, I'm looking at a macro-based approach, but I'd rather not go this route until investigating other options.
Thanks!
After a bunch of digging around, I found pdftk, which is distributed as source and binary on many platforms.
It does almost all of what we need to do, and we can now iterate through our documents and recursively call pdftk on each (some are multi-level attachment chains).
Note pdftk will only burst pages of the visible document into individual documents. The hidden documents remain hidden.
The option you need to use is unpack_files.
Yet another unwanted obfuscation format to hinder interoperability therefore classified as malware.
Using Adobe Acrobat Professional combine all into one pdf and then split by bookmark level
I understand this thread is few years old but if anyone is looking for free utility to extract files from PDF packages (especially from large collections) then check the free utility ByteScout PDF Multitool: it was tested against 500+ MB package files to extract hundreds of multilevel chained attachments.
Disclaimer: I'm affiliated with ByteScout