How to substitude all "\t" (tab characters) with white space in a PDF - pdf

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.

I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

Related

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

How to convert a multi-page PDF table to a spreadsheet format?

I have a huge PDF file with 300+ pages on which a big 10+ column table is spread. I am using Linux and would like to have a simple command line command which would convert this table to a text importable to a spreadsheet.
Currently I am using pdftotext -layout, and gives quite good results, other than every page is considered independently and column widths and positions change from page to page (due to different maximum column content width on each page), so I cannot simply import the resulting text file to a spreadsheet application and split it to columns by a fixed column width.
I have tried to crop every column on every page (their position is identical across the whole PDF file), but in the result the empty rows are merged together, so the rows with content will be shifted with respect to each other.
If pdftotext had an option to convert the file with a STRICT LAYOUT (not by column content width), that would help. Or if I could stack all pages in PDF file to a single page, that could also solve it.
What are the options to solve this problem?
You are misunderstanding the nature of the content of a PDF file. There are no tables in PDf, there is no metadata (generally) to describe the content as a table. The text you see on the page may not be laid out in the reading order.
For example the PDF file might contain a line of text drawn at the top of the page, then one at the bottom, then a paragraph in the middle before jumping back up to the top for a headline.
In addition there may be no spaces between two text fragments. Text is drawn at an absolute position on the page, so you can draw (for example ) cell A, then move the current point by say 1 cm, rthen draw cell B and so on. Since there's no 'space' characters between the two cells, a naive text extraction will, naturally, assume the two lines of text are coninuous.
The STRICT LAYOUT you want isn't impossible, but you can't do it with a simple text file, because the original layout isn't made up of simple text characters, sometimes the space between two characters, or two fragments of text is done by moving the current point before drawing the text.
Ghostscript's txtwrite device in its simplest mode attempts to replicate the layout by replacing the white space with actual space characters in a fixed pitch font. This 'might' be good enough for you, but it equally well might not. That's because it operates by defining the smallest distance used on the page as being one space character. All distances between text is then replaced by a number of space characters, as many as are required to make up the space. This can (and often does) result in very wide output files with a lot of white space.
Essentially what you seem to want isn't really possible, you can't take a rich format like PDF and replicate it, including the layout, with nothing more than text characters.

PDF editing directly then deleting edits still leaves pdf corrupted

My PDF looked fine until I edited it, and now it still appears to be corrupted even after I took out my edits. A file diff program is saying that the two files are the same, but only one is displaying the information.
To reproduce:
1) Open PDF and make sure there is stuff inside of it
2) Open PDF in a text editor and add text at the top
3) Open PDF normally and it is empty
4) delete text added in step 2
5) PDF is still corrupted despite having SAME file contents
This also happens if I literally copy and paste the code from a PDF into a different file and try to open that. It won't open.
Is there any way to be able to be able to add text to a PDF and have it not corrupt?
PDF is a binary format. Even if it looks quite text'ish, it is not text. In particular PDF files usually contain binary data streams, e.g. for images or embedded fonts or compressed arbitrary content. Furthermore, PDFs rely on PDF objects starting at offsets noted in a cross reference table or stream in the file.
Many text editors, though, do not only apply the changes you type in to a document but also do other stuff, like unifying line breaks (DOS CRLF or Unix LF or Max CR), replacing byte sequences they could not interpret by a special character (e.g. the Unicode REPLACEMENT CHARACTER) or dropping them altogether, etc.
The former (unifying line breaks) moves the data without updating the cross reference information, rendering it useless. If the bytes interpreted as line break characters were actually parts of binary stream data, the stream data also is damaged.
The latter (byte sequence replacement) usually damages contents of streams in the PDF with compressed data or other sensitive binary data beyond repair. Depending on the sequence length, this also moves data and so invalidates cross references.
Thus, using a text editor to edit a PDF usually is a sure way to break a PDF.
Is there any way to be able to be able to add text to a pdf and have it not corrupt?
Yes, using PDF aware software, e.g. Adobe Acrobat but there also are others. If you prefer a programming approach, use a good general purpose PDF library. There are such libraries for many programming platforms.
For a very few types of changes, one can also use a hex editor (only replacing some bytes, not inserting or removing anything), but you really should know what you are doing.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

How to change value of a textbox in a pdf

I have to make several certificates with the same design but different names. So I've tried to make an uncompressed pdf file with a place holder text and tried to change it with a text editor. For some reason it didn't work. I could only see a single letter of the replaced text.
When I try the same thing with an eps file, it works but since eps doesn't keep (AFAIK) page orientation, there is a chance that it something will be different with different names.
Does anyone know why this didn't work or how to change a text box in a pdf file (with sed)?
(I created the master pdf with Illustrator CS4)
Thank you
In general, editing PDFs in a text editor is a Bad Idea. PDFs depend on the byte offsets of various objects to not move.
If you KNOW your editor won't change the EOL bytes (or what it thinks are eol bytes), and you DO NOT change the length of the text entry's object as a whole, you're okay.
For example:
1 0 obj
<</Type/Annotation/Subtype/Widget/V(PlaceHolder Value)/T(Field Title)...>>
endobj
If your new value is longer than "placeholder value", you're screwed.
Most PDFs contain quite a bit of compressed binary data. Some of that data WILL be misinterpreted as EOL characters. Changing them will:
a: break your compressed stream
b: possibly change the byte offsets of the rest of the PDF.
When I hack on PDF files, I always use a hex editor.
Bottom Line: Don't mess with PDFs as a text stream. Mess with them as PDF files, using a PDF library. There's sure to be one capable of altering form field values in your language of choice.
You can also look into FDF and XFDF to see if they'll suit you better. Both file formats store field/value pairs and a reference to the form to use with those pairs. FDF uses PDF's syntax, while XFDF is an XML grammar. You can serve the [X]FDF to your end user and they will see the filled-in form.
WARNING: Unless the form is Reader Enabled (requires Acrobat (pro?)), they won't be able to save the version of the form they get after opening the [X]FDF, only view/print it. Of course they can save the [X]FDF, but many users might balk at this Strange New Format.