Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a PDF with the following text:
Localização
When I copy this text and paste, it gives me:
localizac¸ ˜ao
Any help is appreciate
Tks
For computer generated documents (not OCRd/scanned)
Some systems like LaTeX generates composed characters because the system's font doesn't contain (or support) such glyph in the current encoding. As consequence. They are generated on the fly using Composed Glyphs.
Making two glyphs look like one:
A + ´ -> Á
Because of this 'trick', the selectable PDF Text Information contains the two separated glyphs. But graphically they are both rendered at the same spot.
The quick solution:
Luckily, the generated character pairs do not happen naturally in a well written paragraph (maybe in any language). So is quite safe just search/replace them using a case-sensitive method. You can do it manually with your favorite text editor, or using a python script, etc. Automated or not, the principle of the solution is the same.
It is important to know how you are copying the text. If you are merely using a text editor and altering the underlying PDF code, you are going to have problems. PDF files are organized in a very complicated and non-human-readable way that require specialized programs to alter successfully. If you want to make this change, you will need to use a PDF editor to either edit the document, or generate a new document from scratch.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm looking for a tool to display code examples in PDF file. I mean that I would like to colorize and indent code in my PDF (it's for lessons).
I'm not able to find anything on the web or on StackOverflow. It's full of tutorials to use code to make PDF but not to display code in PDF. When I search for 'display' it gives me how to display PDF in web/applications.
Sorry to disappoint you, but:
There is no such thing as you are looking for!
If you want code samples on a PDF page to be syntax highlighted, you must look for a tool that does do this within the source document which was used to generate the PDF file from.
There is no tool in the world, neither Free and Open Source Software, nor commercial payware, that lets you edit a PDF and convert the source samples on its pages into properly syntax highlighted parts. (The only thing you can possibly do on this level is adding specific comments -- here you have to manually highlight specific words or sentences with a background color of your choice.)
If you are looking for a toolchain that makes it easy to generate PDFs from scratch containing syntax-highlighted code samples, look at:
Markdown: a very lean text markup language to write the document in (use any text editor you like)
Pandoc: a powerfull Markdown-to-Anything converter. It's a command line tool available for all major OS platforms. Its output may be PDF, HTML, EPUB, LaTeX (all of the previous with syntax highlighting), as well as ODT, DOCX, DocBook (no syntax highlighting supported so far for the last few) and a few more...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
We used Adobe inDesign to design story books. We need both the PDF file as well as epub file. Since we all view in PDF during the process, the final clear product in PDF, when we export as epub file, it's huge. It all messed up the original design. What can we do?
Why did it happen?
I've worked on ONE project going from InDesign to ePub about two years ago - and you are right it is a mess. It didn't understand which local overrides to keep and practically every paragraph had style="localoverride1 localoverride2 substyle3 etc" in it. It was a mess to sort and clean up.
After that miserable experience we've found that it is better to view PDF and ePub as two separate products. Our workflow takes source XML and goes EITHER into InDesign OR through an XSLT to make an ePub. We no longer use InDesign to attempt to make ePubs - with an XSLT there is a LOT more control over the look and feel of the final product.
However if you are dead set on using InDesign - I've heard that it does fixed layout "epub" fairly well (basically it ends up being a bunch of images - it's not reflowable).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have been trying the whole day to convert several. pdf files which contain traffic flow for São Paulo to spreadsheets like MS Office Excel, or LibreOffice Calc in Ubuntu. When I open the .pdf file with LibreOffice Calc it opens LibreOffice Draw, and I can't get the spreadsheet.
The most promising method that I found was here with pdftotext. It works fine and I can get the tables in LibreOffice Calc but adjusting manually the columns.
My problem is that I have so many .pdf files that it would take me a lot of time.
Does anyone know a better method?
Another option is to use Okular (http://okular.kde.org).
It has table selection tool (Ctrl+5).
You may select a table, add lines for additional rows and columns and copy the resulting table into a clipboard.
It works fine for me.
Tabula can work quite well. PDF is not an easy format to extract structured information from, so it's not always possible.
Maybe the -layout would be useful for you. With this option set, pdftotext will try to keep the column layout in the resulting text file.
Now, you can import the text file into LibreOffice Calc with the appropriate import settings. When opening a txt file in Calc, you will get asked how to parse the file content (see screenshot below). Under Separator Options, select both the Options [separated by] Space and Merge Delimiters. This way, Calc will be able to restore the column structure (assuming the cell data doesn't contain spaces).
Tool called Able2Extract is the option that can do for you exactly wat you want with minimum errors
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to add some text in a pdf document from latex. The text is not supposed to be seen in the actual PDF, I want it more to be like a comment in a code, so I can load the "code" in a program and read the comments. Is this possible?
Kind regards
I don't know Latex enough to comment on that part of your question, but there are a number of different ways information can be stored inside PDF files that would satisfy your question.
Images in PDF files are typically objects (Image XObjects to be exact) - these have a dictionary where additional information could be stored next to the image data.
PDF supports the concept of object metadata where XMP metadata can be embedded in a PDF file for a specific object. This would be a second way to embed additional non-visible information in the PDF file (and a better one).
And perhaps best of all if you can generate this from Latex is the fact that PDF allows object properties, which uses marked content operators in the page stream to delineate a number of objects and then allows associating information to that marked content.
All of those should be easy to find in the PDF specification on the Adobe website; what would remain would be to figure out what ways you have in Latex to generate any of this and what you'd have to do to read them in your program :-)
There are two different ways:
You can either comment out single lines by adding a % in front of them
% This text will be a comment
Or you can comment out larger sections by doing this:
\usepackage{comment}
\begin{comment}
This text will be commented out.
\end{comment}
Hope this helps!
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a CV in PDF format which is to be converted to LaTeX code. Is there a way to 'reverse engineer' the PDF so that I can get the latex code?
Short answer: No
Slightly longer answer:
You may get the plain text back but you can't restore the original latex source.
You may be able to import PDF into a word processor and export LaTeX from it (Either AbiWord of KOffice can do that, if I remember correctly), but the result will not be pretty. This won't get you the original LaTeX, but a very poor approximation. I think recreating the CV from scratch in LaTeX will be easier.
No. An explanation can be found here:
The job just can’t be done automatically: DVI, PostScript and PDF are
“final” formats, supposedly not susceptible to further editing —
information about where things came from has been discarded. So if
you’ve lost your (La)TeX source (or never had the source of a document
you need to work on) you’ve a serious job on your hands. In many
circumstances, the best strategy is to retype the whole document, but
this strategy is to be tempered by consideration of the size of the
document and the potential typists’ skills.
Just like you can automatically reverse engineer C code (though not very readable and with certain limitations) from a compiled exe you should be able to reverse engineer the LaTeX code from a compiled PDF. There just don't seem to be any tools around that even attempt this. This would sure be an interesting thing to implement.
There's some research going on in that area:
http://www.fi.muni.cz/~sojka/dml-2011-baker-sexton-sorge.pdf
The Latex file will have been printed to PDF, converting the contents into Postcript commands.