Add comments in pdf [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to add some text in a pdf document from latex. The text is not supposed to be seen in the actual PDF, I want it more to be like a comment in a code, so I can load the "code" in a program and read the comments. Is this possible?
Kind regards

I don't know Latex enough to comment on that part of your question, but there are a number of different ways information can be stored inside PDF files that would satisfy your question.
Images in PDF files are typically objects (Image XObjects to be exact) - these have a dictionary where additional information could be stored next to the image data.
PDF supports the concept of object metadata where XMP metadata can be embedded in a PDF file for a specific object. This would be a second way to embed additional non-visible information in the PDF file (and a better one).
And perhaps best of all if you can generate this from Latex is the fact that PDF allows object properties, which uses marked content operators in the page stream to delineate a number of objects and then allows associating information to that marked content.
All of those should be easy to find in the PDF specification on the Adobe website; what would remain would be to figure out what ways you have in Latex to generate any of this and what you'd have to do to read them in your program :-)

There are two different ways:
You can either comment out single lines by adding a % in front of them
% This text will be a comment
Or you can comment out larger sections by doing this:
\usepackage{comment}
\begin{comment}
This text will be commented out.
\end{comment}
Hope this helps!

Related

How to make InDesign's epub file vs. PDF file compatible? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
We used Adobe inDesign to design story books. We need both the PDF file as well as epub file. Since we all view in PDF during the process, the final clear product in PDF, when we export as epub file, it's huge. It all messed up the original design. What can we do?
Why did it happen?
I've worked on ONE project going from InDesign to ePub about two years ago - and you are right it is a mess. It didn't understand which local overrides to keep and practically every paragraph had style="localoverride1 localoverride2 substyle3 etc" in it. It was a mess to sort and clean up.
After that miserable experience we've found that it is better to view PDF and ePub as two separate products. Our workflow takes source XML and goes EITHER into InDesign OR through an XSLT to make an ePub. We no longer use InDesign to attempt to make ePubs - with an XSLT there is a LOT more control over the look and feel of the final product.
However if you are dead set on using InDesign - I've heard that it does fixed layout "epub" fairly well (basically it ends up being a bunch of images - it's not reflowable).

Extract table data from PDF [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Is there any consistent way to extract tables from PDF files? Any tools?
What I have done so far:
I have tried out pdftotext tool. It has an option to convert to HTML layout.
What is the problem with this:
The table information is not preserved in HTML output
I expected <table> tags, but everything was under <p> tags.
Will there be any markers in a PDF document to indicate table structures? Like <table>, <tr> and <td> in HTML?
If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.
What you could do however, is use the pdftotext -layout input.pdf output.txt.
It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.
If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get.
I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.
There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.
If the PDF document misses information that marks content as table, row, cell, etc. (known as tags), then there is no consistent way to extract tables from the PDF document. Mostly, PDF documents do not contain these tags. These tags typically serve to make a PDF accessible so that it can for example be read aloud. These tags are not required for a PDF to be valid.

Copy and Paste PDF text gives wrong text [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a PDF with the following text:
Localização
When I copy this text and paste, it gives me:
localizac¸ ˜ao
Any help is appreciate
Tks
For computer generated documents (not OCRd/scanned)
Some systems like LaTeX generates composed characters because the system's font doesn't contain (or support) such glyph in the current encoding. As consequence. They are generated on the fly using Composed Glyphs.
Making two glyphs look like one:
A + ´ -> Á
Because of this 'trick', the selectable PDF Text Information contains the two separated glyphs. But graphically they are both rendered at the same spot.
The quick solution:
Luckily, the generated character pairs do not happen naturally in a well written paragraph (maybe in any language). So is quite safe just search/replace them using a case-sensitive method. You can do it manually with your favorite text editor, or using a python script, etc. Automated or not, the principle of the solution is the same.
It is important to know how you are copying the text. If you are merely using a text editor and altering the underlying PDF code, you are going to have problems. PDF files are organized in a very complicated and non-human-readable way that require specialized programs to alter successfully. If you want to make this change, you will need to use a PDF editor to either edit the document, or generate a new document from scratch.

Analyze format of pdf-file and extract text and images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to extract the "articles" from this magazine which has both text and images. The image content has to be placed separately, the text extracted (as far as possible) and placed separately.
How do i go about doing this? Is there a commercial service / api that does this already? The input to the program/service will just be the file.
Eg of input: http://edition.pagesuite-professional.co.uk/pdfspool/rQBvRbttuPUWUoJlU6dBVSRnIlE=.pdf
(the actual file would be a normal pdf-file, not a seured one)
Docotic.Pdf library can extract images and text from PDF files for you.
Here are couple of samples for your task:
Extract text from PDFs
Extract images from a PDF
Extracted images can be saved as JPEG and TIFFs. You can extract text from each page or from the whole document. And you can extract text chunks with their coordinates.
Disclaimer: I work for Bit Miracle, vendor of the library.
Try this one:
http://asp.syncfusion.com/sfaspnetsamplebrowser/9.1.0.20/Web/Pdf.Web/samples/4.0/Importing/TextExtraction/CS/Default.aspx?args=7
The same component has the image-extraction feature also.
You could make a try!!
If you can afford a commercial option, Amyuni PDF Creator will allow you to enumerate all components inside the pdf file (text, image, etc), you will be able to extract them as independent objects and you can create new PDF files with them.
You may use Aspose.Pdf.Kit to extract text and images separately from a PDF file. The API is quite simple. You can also find samples, tutorials and support on Aspose website.
Note: I'm working as Developer Evangelist at Aspose.

get latex code out of a PDF file [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a CV in PDF format which is to be converted to LaTeX code. Is there a way to 'reverse engineer' the PDF so that I can get the latex code?
Short answer: No
Slightly longer answer:
You may get the plain text back but you can't restore the original latex source.
You may be able to import PDF into a word processor and export LaTeX from it (Either AbiWord of KOffice can do that, if I remember correctly), but the result will not be pretty. This won't get you the original LaTeX, but a very poor approximation. I think recreating the CV from scratch in LaTeX will be easier.
No. An explanation can be found here:
The job just can’t be done automatically: DVI, PostScript and PDF are
“final” formats, supposedly not susceptible to further editing —
information about where things came from has been discarded. So if
you’ve lost your (La)TeX source (or never had the source of a document
you need to work on) you’ve a serious job on your hands. In many
circumstances, the best strategy is to retype the whole document, but
this strategy is to be tempered by consideration of the size of the
document and the potential typists’ skills.
Just like you can automatically reverse engineer C code (though not very readable and with certain limitations) from a compiled exe you should be able to reverse engineer the LaTeX code from a compiled PDF. There just don't seem to be any tools around that even attempt this. This would sure be an interesting thing to implement.
There's some research going on in that area:
http://www.fi.muni.cz/~sojka/dml-2011-baker-sexton-sorge.pdf
The Latex file will have been printed to PDF, converting the contents into Postcript commands.