Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm looking for a tool to display code examples in PDF file. I mean that I would like to colorize and indent code in my PDF (it's for lessons).
I'm not able to find anything on the web or on StackOverflow. It's full of tutorials to use code to make PDF but not to display code in PDF. When I search for 'display' it gives me how to display PDF in web/applications.
Sorry to disappoint you, but:
There is no such thing as you are looking for!
If you want code samples on a PDF page to be syntax highlighted, you must look for a tool that does do this within the source document which was used to generate the PDF file from.
There is no tool in the world, neither Free and Open Source Software, nor commercial payware, that lets you edit a PDF and convert the source samples on its pages into properly syntax highlighted parts. (The only thing you can possibly do on this level is adding specific comments -- here you have to manually highlight specific words or sentences with a background color of your choice.)
If you are looking for a toolchain that makes it easy to generate PDFs from scratch containing syntax-highlighted code samples, look at:
Markdown: a very lean text markup language to write the document in (use any text editor you like)
Pandoc: a powerfull Markdown-to-Anything converter. It's a command line tool available for all major OS platforms. Its output may be PDF, HTML, EPUB, LaTeX (all of the previous with syntax highlighting), as well as ODT, DOCX, DocBook (no syntax highlighting supported so far for the last few) and a few more...
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Is there any consistent way to extract tables from PDF files? Any tools?
What I have done so far:
I have tried out pdftotext tool. It has an option to convert to HTML layout.
What is the problem with this:
The table information is not preserved in HTML output
I expected <table> tags, but everything was under <p> tags.
Will there be any markers in a PDF document to indicate table structures? Like <table>, <tr> and <td> in HTML?
If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.
What you could do however, is use the pdftotext -layout input.pdf output.txt.
It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.
If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get.
I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.
There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.
If the PDF document misses information that marks content as table, row, cell, etc. (known as tags), then there is no consistent way to extract tables from the PDF document. Mostly, PDF documents do not contain these tags. These tags typically serve to make a PDF accessible so that it can for example be read aloud. These tags are not required for a PDF to be valid.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a PDF with the following text:
Localização
When I copy this text and paste, it gives me:
localizac¸ ˜ao
Any help is appreciate
Tks
For computer generated documents (not OCRd/scanned)
Some systems like LaTeX generates composed characters because the system's font doesn't contain (or support) such glyph in the current encoding. As consequence. They are generated on the fly using Composed Glyphs.
Making two glyphs look like one:
A + ´ -> Á
Because of this 'trick', the selectable PDF Text Information contains the two separated glyphs. But graphically they are both rendered at the same spot.
The quick solution:
Luckily, the generated character pairs do not happen naturally in a well written paragraph (maybe in any language). So is quite safe just search/replace them using a case-sensitive method. You can do it manually with your favorite text editor, or using a python script, etc. Automated or not, the principle of the solution is the same.
It is important to know how you are copying the text. If you are merely using a text editor and altering the underlying PDF code, you are going to have problems. PDF files are organized in a very complicated and non-human-readable way that require specialized programs to alter successfully. If you want to make this change, you will need to use a PDF editor to either edit the document, or generate a new document from scratch.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a free way to read PDF files through VBA to extract basic text content? I need to automate a weekly data acquisition process at my company where data is contained in PDF files (which are updated weekly by the data provider). Also, is there a reference I can look into to understand the file structure (DOM?) of a PDF?
Adobe's PDF reference is online here: http://www.adobe.com/devnet/pdf/pdf_reference.html
I'm not sure about the best way to read PDFs from VBA directly, but if you can call an external Java or C# program, then I would recommend using iText for basic text extraction.
EDIT: I should maybe mention that Adobe's PDF reference is an 800 page beast. I found that it's good for looking up answers to particular questions (eg, storing widths of embedded truetype fonts), but it may not be a good place to start. For that, reading through the iText book helped me to get started on the format.
The IText book contains lots of worked examples for general PDF tasks and lots of background info to help you understand PDF files. It more than pays for itself very quickly!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We have a bunch of documents in our organization that were inadvertently saved as Adobe PDF packages (also known as PDF 1.7 "collections").
We would like to convert these to normal PDFs (most of these "packages" contain one bog-standard pdf file), but given the number of files, it's not possible manually.
Any Adobe expert know whether:
There is an open-source or free library that handles PDF package format that I can write a script around?
Does Adobe Pro 9 have a relevant scriptable interface that would allow me to extract the relevant file from each package?
Alternatively, I'm looking at a macro-based approach, but I'd rather not go this route until investigating other options.
Thanks!
After a bunch of digging around, I found pdftk, which is distributed as source and binary on many platforms.
It does almost all of what we need to do, and we can now iterate through our documents and recursively call pdftk on each (some are multi-level attachment chains).
Note pdftk will only burst pages of the visible document into individual documents. The hidden documents remain hidden.
The option you need to use is unpack_files.
Yet another unwanted obfuscation format to hinder interoperability therefore classified as malware.
Using Adobe Acrobat Professional combine all into one pdf and then split by bookmark level
I understand this thread is few years old but if anyone is looking for free utility to extract files from PDF packages (especially from large collections) then check the free utility ByteScout PDF Multitool: it was tested against 500+ MB package files to extract hundreds of multilevel chained attachments.
Disclaimer: I'm affiliated with ByteScout
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
At work we use Sandcastle for creation of help files. I have been using SandCastleGUI for some time and I'm looking for a way to create additional pages in the help file.
These pages are written in XML format called MAML.
The only problem is that I couldn't find any decent editor for these file format.
I'm looking for a WYSIWYG editor to create & edit additional documentation pages.
You could use a generic XML editor with WYSIWYG support like Oxygen or Serna. You would need a Xml Schema or DTD for MAML, I assume there is one somewhere in an SDK or such. Probably the harder part is that you would need a stylesheet that renders the XML to an display format that can be used by the editor to provide a WYSIWYG view of the document.
It works rather well for standard XML formats such as Docbook, but I don't know how easy it is to find/create the needed stylesheets for MAML. But generally there is no reason why it couldn't be done.
Finally I found a solution the good people of SandCastle Help File Builder have included an HTML to MAML converter.
There are many good HTML editorsout there - and now I can use one of them and then convert the result to MAML
Don't know if you are still looking for a solution to this, but I've been looking at help editors and ran across a codeproject article that might be useful. The article can be found at http://www.codeproject.com/KB/dotnet/DocMounter_2_Sandcastle.aspx. It features an editor that might be just what you need.