How can I create PDF documentation with cut-and-paste-able code snippets? [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've tried using LaTeX and DocBook for documenting programming tools, to get PDF output. What I've found is that these tools are excellent in some ways - easily versioned, and generating very usable PDF manuals. But there is a serious flaw. Code-snippets cannot simply be cut-and-pasted out of the PDF.
With DocBook, the problem is the loss of whitespace - mostly for indentation, but any repeated spaces seem to get stripped out. So, once you paste the snippet into a text editor, you'll need to clean up the indentation and vertical alignment. Not too much hassle for two or three lines, but it quickly gets annoying.
With LaTeX - well, it's a mess. The following was taken from a PDF generated using the LaTeX in MikTeX 2.8.
node myclas s
f f i e l d f i e l d 0 1 : i n t ;
f i e l d f i e l d 0 2 : ” char ” ;
g;
The intended example is...
node myclass
{
field field01 : int;
field field02 : "char*";
};
Other than the fact LaTeX plays with the quotes, the intended form is what you see in Adobe Reader - but not much like what you get from a cut-and-paste. Don't ask me what's going on with the spaces, or why the braces turned into letters, or what happened to the asterisk - I don't know!
Mostly, I've noticed these things playing with ways of keeping my own personal notes, and just went back to other ways. Some notes are in HTML or plain text, so I can version them. Others are in an old Journal program I've used for years. But I've written a tool that I may want to release soon - and I'll want to include a usable PDF manual, which will need to include examples.
So - is there a way of creating PDF documentation where the code snippets can be easily cut-and-pasted? Preferably a way that allows me to keep "sources" in versioned text files.
EDIT
Any solution must be portable. I will need to use it on Linux and on Windows XP.
EDIT
It looks like this may be impossible.
I've tried printing from Notepad++ to the Adobe Acrobat Pro 7 printer driver. The resulting document looked fine, but cutting and pasting gave the same missing whitespace problems as occur with DocBook.
I tried using the touchup text tool in Acrobat Pro to add leading spaces. These are preserved when you save and reload - but when you select text normally in acrobat, they aren't included. You can only cut-and-paste including those spaces using the touchup text tool, so far as I can tell, which is obviously not included in reader.
In other words, this looks like a fundamental limitation - not of the PDF format itself so much as the tools that work with it. There appears to be a general assumption at work here that whitespace is insignificant - which for my purposes obviously isn't true.
EDIT
One solution may be a "text field". I can add these fairly easily using Acrobat Pro, can set a fixed width font, enter multiple lines of text and make the field read only. In Acrobat Pro 7, the text in the field then isn't selectable - but in Reader 9 it is selectable and everything is preserved when you cut and paste.
The question is - can text fields be generated directly using some kind of markup language that is usable to create complete manuals?

I'd suggest enscript. I use it for producing archives and documentations.
Also, you can merge multiple source codes ps'ed with enscript into another pdf.

If your code is kept in external files, one way would be to attach the original file(s) as PDF attachments. This could be done with Docbook, LaTeX, DITA, and a few others.
For example, if you are using this method to include code in Docbook, you can write some code to your XSL customization layer for adding the external file as an attachment to the PDF. As far as I know, this is portable (although I haven't personally tried to open PDF files with attachments in Evince, Okular, Xpdf, etc to see what happens).
If you are processing the Docbook files using even FOP, you should still be able to write something into your customization layer to attach files. See the section on PDF attachments. You could even output a link to the attachment below the codeblock in the PDF if you want to make it more discoverable to people.
A similar solution should be possible using LaTeX with the attachfile package.

Related

How to create business ready reports from jupyter notebooks? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 months ago.
Improve this question
I have took quite some time to get a reasonable answer to my inquiry by myself but ran into a dead end and hope you guys can help me.
Issue:
For the purpose of business reporting, I have created some juypter notebooks which include multiple pandas tables and seaborn / matplotlib plots as code cell output with some occasional markdown cells in between to provide explanations. Now, I want these reports to be in a business-ready format to share them with stakeholders. With business-ready I intend the following requirements:
The report does not include code
Output file format: PDF
The report includes a title page with title, additional information (e.g. date of analysis) and a table of contents
Tables are in a appealing visual format that allows easy reception of information
The report is well structured
... and I am not able to get all these requirements together.
So far, I prefer to work with vscode and use the browser based juypter notebook if necessary (which unfortunately lacks some functionalities).
What I have tried:
(1) this was a no brainer, I just --no-input to the nbconvert command in the anaconda shell and whatever I do regarding the next points, it excludes the code
(2) There are two ways I could find so far, which influence all subsequent steps/requirements
Way 1 ("html detour"): I convert the .ipynb to html and print it as PDF (this is a 2-step process, thus I see it as a detour)
Way 2 ("latex conversion"): I convert it to a PDF via nbconvert --to pdf and it uses latex in the background to create a pdf
(3) ...and here start the issues:
html detour: I can get a toc via the nbextension extension for jupyter notebooks and with it, I can use either the H1 header level as title or include an extra markdown cell and increase the font size with an html command such that it looks appealing. Additional information are added manually in extra code cells. However, the toc only works in the browser version of jupyter, which results in writing the analysis in vscode, going to the browser to add the toc, converting it in the shell, open the html and print it as pdf...
latex conversion: I can set up a latex template, which is included in the nbconvert command that includes a toc by design. However, it either picks up the filename as title automatically or a title I can set in the metadata of the notebook, which I can only edit from the browser. Further, the date of conversion is added below the title automatically as well, which might note be the date of the analysis in case I have to reconvert it because someone wants a minor change or something. Thus, I cannot turn auto title and date off (at least I couldn't find an option so far) and I have multiple steps as well.
(4) This one makes eventually the difference in the usability of the report
html detour: The format in the html file itself is the quite appealing format you usually get from tables using display() command on a table in jupyter (which is used anyway if you just call a variable in juyper without print()) or if you build a table in a markdown cell. The table has a bold header and every other row has a grey background. Using pandas .style method, I can format the table in the html file very nicely with red fon color for negative values only or percentage bars as cell background. However, I loose all these formats when I print the PDF. Then its just a bold header, a bold line splitting header and body and the rows. Further, all cell output tables are left aligned in the html (and I refer to the table itself, not its content) and the markdown tables are centered, which looks strange or rather - and this is the issue - unprofessional. The benefit, however, is that these tables are somewhat auto-adjusted to a letter size format in a certain range if the table would be wider than a letter page.
latex conversion: By design, the tables are not converted. I have to use pandas.set_option(display.large_repr, True) to convert all subsequent pandas table output or add .to_latex()to every single pandas table. This has several downfalls. Using this, all tables are displayed as the code that would be required to build a table in latex and while doing the analysis, this is often harder to interpret... especially if you want to find errors. Adding it only when the analysis is done, creates just unnecessary iterations. Further, I want to use the last report as template for the next and would have to delete the command, do my stuff and add it again. Wider tables taht don't fit the letter size are just cut of regardless of how much wider they are compared to the page size and I would have to check every table (last report were 20+) whether everything is included. ...and headers become longer if they include explanatory information. And finally, the latex table format eventually looks professional, but more scientifically professional and not business professional and can discourage one or another reader in my experience.
(5) So, since everything is made from cells and converted automatically, you get some strange output with headers on the end of one page and text and tables and plots on the next ...or pages with just a plot and so on...
html detour Its hard to describe the general issues I have. If you have ever printed a website, you have probably got some weird text bulk that looks unstructured with occasional half white pages where they should not be. Thats what you get, when printing the html file of a jupyter. It would help, if I could include a forced pagebreak and you can find several versions of adding pagebreaks in the cell or metadata of cells but they do not work since the html is created with a high level setting prohibiting a pagebreak. Thus, I could only go in the html code and add page breaks manually. Manuel effort I would like to avoid.
latex conversion:Well, \pagebreakworks.
So, due to the issues above, I currently tend towards the html detour but it does not make it look like an appealing report. I have tried several latex templates but was usually dissatisfied with the output since the .to_latex command makes it tedious and the report eventually looks like a scientific paper and not like a business report. The thing is, while this looks like a high standard, all these requirements are fulfilled by R-mardkown notebooks basically out of the box with slight additions to the yaml command in the top of the file. But I cannot use them for the report I want to create.
So, after this long intro (and I thank everybody for taking the time to read it), my question is how do I get appealing reports from a jupyter notebook?
Thanks!!!!!
Honestly, I'm in the same boat as you. It seems quite challenging to generate publication-ready PDF Reports natively from JupyterLab / Jupyter using nbconvert and friends.
Solution (that I'm using): What I can recommend is a different tool that will help you make amazing PDF reports. It's using RStudio's Rmarkdown (completely free) and the new ability to use Python from RStudio. I'm going to be teaching this in my R/Python Teams Course (course waitlist is up).
Report Example
Here's how I'm doing it in my course:
Step 1 - Install Rstudio IDE 1.4+ & R 4.0+
Head over to Rstudio and install their IDE. You'll also need to install R.
Step 2 - Create a Project
Step 3 - Set Python Environment of your Project
Go to Tools > Project Options. Select the Python Interpreter.
Step 4 - Begin Coding Markdown and Python
Use "Python Code Chunks".
Step 5 - Knit to PDF
Note that this requires some form of LatTex. You can install easily with this package: tinytex.
Step 6 - Check out your PDF Report
Looks pretty slick.
Try it out and see if it works for you.
I'd go like this from terminal (this is to convert to Word, but also PDF is available, just change your last output to .pdf):
jupyter nbconvert --to html notebook.ipynb --TemplateExporter.exclude_input=True && pandoc notebook.html -s -o results.docx --resource-path=img --toc
Apart from installation and other pieces there are several aspects which make usage of nbconvert for files conversion quite a tedious task .
Anyone tried out the Jupyter Executable Notebook or R markdown methods ( they are useful but there is an extra cost of time and efforts which makes it less feasible )
What i found to be very useful is there are many websites serving this purpose it quick, easy and hassle free .
I use this IPYNB TO PDF , there are others as well .

Replace words/phrases in existing PDF or docx with other words

I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.

Is there any GNU/Linux command line utility that converts .doc(x) files to .pdf? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Surely, I am the 100th user who is asking this but after I have searched through similar topics here and on other websites I still cannot find what I need.
I like to have a simple command line tool for my GNU/Linux which converts .doc(x) files to .pdf BUT the output should look the same as the original.
LibreOffice doesn't seem like a good choice for this because it does not convert well in some cases. I have found a website freepdfconvert.com which does the job very well, but I cannot upload any sensitive files since it is a big risk. I don't say they would do anything bad with them but it is how it is.
If I can't find any good tool maybe I will have to write one myself.
Unfortunately there are no Linux-based guaranteed 1-to-1 convertors for Word (doc/docx) to PDF. This is because Word, a Microsoft product, uses a proprietary format that changes slightly with every release. As it was not traditionally a publicly documented format and Microsoft does not port Word/Office to Linux (nor ever will) then you must rely upon reverse engineered third party tools for older formats (doc) and proper interpretation of the Office Open XML format by third party developers.
We found the best open source solution is LibreOffice (which was forked from OpenOffice.org, which itself was called Star Office before it was open sourced). It is much more actively developed than AbiWord, as another answer suggested.
The usage from the command line is simple and well documented with plenty of examples:
soffice --headless --convert-to pdf filename.doc
Or also you can use libreoffice instead of soffice on newer versions.
There is also Pandoc.
Pandoc, mainly known for its Markdown-capable processing goodness (for outputting HTML, LaTeX, PDF, EPUB and what-not) in recent months has gained a rather well-working capability to process DOCX input files.
(NOTE: Pandoc only works for DOCX, not for DOC files.)
For its PDF output to work, it requires a working LaTeX installation (with either or all of pdflatex, lualatex and xelatex included). In this case the following simple command should work:
pandoc -o output.pdf -f docx input.docx
Note however, that the output layout and font styles now will not look at all similar to what it would look if you exported the DOCX from Word to PDF. It will be using the styles of a default LaTeX document.
You can influence the output style of the LaTeX-generated PDF by using a custom template file like this...
pandoc \
-o output.pdf \
-f docx \
--template=my-latex-template.tmplt \
input.docx
...but this is a feature more for Pandoc/LaTeX experts to use than for beginners.

Example PDF language code which helps to study the official PDF specification? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am trying to learn the PDF file format.
To this end I downloaded Adobe's PDF specification file, which is huge.
So to help me study the details of PDF, I want to follow its abstract explanations by looking in parallel at some real-world PDF files.
For example, one idea was to create a PDF file (using LaTeX) which has only one page and as content even only one character, a.
But when I open this PDF file in a hex editor (or in other tools that can show the internal PDF structure), there is a lot of binary or compressed content inside this PDF. For an example for what I see, look at the screenshot below:
I simply can not identify which part of this binary is representing my character a in this PDF.
The same happens with all the real-world PDF files I've tried so far. I simply cannot find any PDF files which contain working example code to help me understand the generic PDF language specification.
I would like others to explain to me: is there a practical way to study the PDF specification while at the same time verifying its bits and pieces with real PDF files?
I would like to know: which software tools are commonly used by PDF programmers that would help a newbie developer like me to dissect and un-compress existing binary PDF files so their source code can be investigated using a simple text editor? (Note: I'm not asking for a recommendation. In compliance with the SO FAQ I just want to know if such tools do exist, and which names they have.)
Is there a resource of freely available PDF files which don't contain binary and/or compressed content? Or how could I create my own such example files?
Are there (preferably free) PDF editors/parsers available which can visualize + dissect the raw binary data of PDF files and expose their structure?
I only need a first hook. The entry point, if you will, to the narrow path in the thick jungle of real world PDF files, which I then could follow along... while using the help of this bushwacker called 'PDF Specification'.
The creators of iText (a Java/C# lib to create and manipulate PDFs) published a tool called RUPS.
From the sourceforge page:
RUPS is an abbreviation for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams. (Updating PDFs isn't possible yet.)
The way I helped myself to learn PDF syntax was this:
Looked for a tool that could de-compress PDFs (de-compress the internal streams).
Found qpdf, Jay Birkenbilt's commandline tool described as: "does structural, content-preserving transformations on PDF files".
Routinely running qpdf --qdf input.pdf decompressed-input.pdf.
Opening the newly created decompressed-input.pdf in a text editor.
The --qdf mode of the tool transforms the binary and ASCII elements of PDFs in a very useful way, without changing their visual page appearance (and it's very fast):
Decompress previously compressed objects (exposing f.e. the PDF language source code of page element drawing operations).
Also expand object streams (ObjStrm).
Normalize the presentation of arrays, strings etc.
Re-number objects so they start from 1 0 obj and then present them in ascending order in the file.
Repair b0rken xref entries.
Add comments which contain an object's original identity in the original file.
Add comments for each page.
...and some more.
Looking at these (now mostly ASCII) files in a normal text editor is way more easy than trying to figure out the original binary PDF.
I would recommend taking a look at a few files using PDF Vole (a tool based on iText, and similar to RUPS).
PDF Vole and RUPS will both allow you to navigate through the structure of a PDF file, inspect the entries on every object, decompress compressed streams, decrypt the file when needed, look at the content of pages and annotations, and track down the relation between objects in the file.
For example this file:
Will look like this in PDF Vole:
You could also take a look on the class hierarchy of iText itself (which is almost 1-to-1 with the PDF spec) and the book that explains it, iText in Action.
If you are trying to generate PDF files via code, then this CodeProject source code might help.
The code along with the Adobe specification should get you going. I don't think there are many short cuts here. Understanding PostScript is going to take some study!
EDIT: and seeing as a PDF is compressed PostScript, something like RoPS could be handy too.

Structure of a PDF file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
PDF Succinctly book is longer and has helpful pictures.
Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
When I first started working with PDF, I found the PDF reference very hard to navigate.
It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.
If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.
Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.
I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.
This may help shed a little light:
(from page 11 of PDF32000.book)
PDF syntax is best understood by considering it as four parts, as shown in Figure 1:
• Objects. A PDF document is a data structure composed from a small set of basic types of data objects.
Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other
syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects.
Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream
object.
• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are
accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub-
clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level
mechanism for protecting a document’s contents from unauthorized access.
• Document structure. The PDF document structure specifies how the basic object types are used to
represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7,
"Document Structure," describes the overall document structure; later clauses address the detailed
semantics of the components.
• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of
a page or other graphical entity. These instructions, while also represented as objects, are conceptually
distinct from the objects that represent the document structure and are described separately. Sub-clause
7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.
Looks like navigating a PDF file will require a little more than a passing effort.
If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.
Didier have a tool to parse the PDF:
http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip
or here:
http://blog.didierstevens.com/programs/pdf-tools/ which cataloged several related pdf-analysis tools.
Another tool is here:
http://mshahzadlatif.wordpress.com/2011/09/28/view-pdf-structure-using-adobe-acrobat-or-a-free-tool-called-pdfxplorer/
Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.
One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.
Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.
I'm trying to make up a spreadsheet to create a PDF form from code.
You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.
On windows I used a free tool PDF Analyzer to see the internal structure of PDF files.
This will help in your understanding when reading the reference manual.
(I'm affiliated with PDF Analyzer, no intention to promote)
To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.
Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.
To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the different octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.