how to fix square boxes in pdf? [closed] - pdf

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
while going through my pdf for regular expressions, and in many places i see that some characters are replaced by square boxes which is some ASCII code
Is there any way i can fix this?
i have checked this link
http://www.tableausoftware.com/support/knowledge-base/square-boxes
http://acrobatusers.com/tutorials/text-matching-regular-expressions
and others but did not find any solution... aatched is how the square boxes look...

As stema said, this has nothing to do with regular expressions.
Neither is it about some "pdf escape sequences", as PDF uses binary safe text encodings.
These square blocks are usually shown in place of some characters that doesn't have a representation in the chosen font. Often, it happens that the typesetting software replaces some quotes or other characters with a 'nicer' Unicode alternative; but the font doesn't have those characters.
You could try to copy/paste the text from the PDF into some other document and replace the font, or even use some PDF editing tools (enfocus PitStop is one of the most popular; it's cheap but not free) to replace the font with another more complete.

At first, this has nothing to do with regex, except that the document you are writing is about regular expressions.
I assume, the sequence that is replaced by a square is \s, isn't it?
I think the problem here is that some regular expression shortcuts are interpreted as escape sequences in the pdf creation process and therefor not printed literally.
You don't write how you create your pdf, but I would assume that will be OK when you escape the backslashes, when you want to print them literally.
So when you want to see a \s in the pdf, type \\s in your source format. (If you have somewhere a escaped backslash you want to print like \\ then write \\\\).

Javier's answer is nearly complete. But let me add this:
You'll have a small chance to get Acrobat Reader display the square boxes using a "substitute" font by toggling a certain setting in its application preferences.
IIRC, the setting is called 'Use local fonts'. You can usually find it in the Page display section of the preferences settings, but over the different releases Adobe kept adding, removing or re-locating different settings...
Background info: If you have NOT enabled Use local fonts, then you require the Reader to only use the PDF-embedded fonts for displaying all text. In case the font is embedded, but misses some required glyphs, enabling said setting may find the required font on your system to render the text, or the Reader may use its built-in Multiple Master fonts which will try to fake the look of the original glyph, more or less....

Copy and paste the text to another word sheet.
Select the text that contains the squares.
Control + space.
Choose the font that you want for your text again.
Save the word as a pdf again.
.- Voila: it is done!

Related

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.
I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

How to "mask" certain text in a PDF document

I have a PDF document, and I want to mask certain text blocks. The reason why I want to do this, is because I don't want this text to be indexed, nor I want this information to be easily accessible by selecting and copying this text block.
What should be the right way to do this?
I guess turning the text to raster would be bad idea, and I don't know if there is some tool that can make only cartain text parts with special privileges.
You will need a program that can convert a font into a series of shapes.
Illustrator may have the functionality you want: see here and here.

Formatting plain text output for printing?

I have a program that outputs a report into plain text. The report must be plain text for it to load into a third party program. The report also needs to be printable.
When dealing with plain text, what limits should I set on line size and number of lines on a page to get it to print reasonably?
It definitely depends on the font you use when printing, and unless you have control over that you can't guarantee it will print nicely. For example, in Word 2007, creating a blank document and setting the font to Courier New 10pt only fits 77 characters per line and 28 lines per page. Changing the margins and line spacing will modify that. However if you used that and they tried to print from Wordpad it wouldn't work because the default with Courier New 10pt only fits 72 characters per line. In either case, the standard 80 characters doesn't work. Those defaults aren't even global defaults.
The best you can probably do is pick a size and provide instructions on printing the report with several common editors so it will look acceptable. Specify the font, margin, line spacing, etc.
Have you considered other options, like creating two files? One could be plain text for import into the other program. The other could be a format better suited for printing.
its going to depend on your printing font, you want to aim for a fixed width font so that it is consistent, 80 columns is generally safe.... i think....
edit: here is a quick guide I googled - http://dsl.org/cookbook/cookbook_17.html
If you have control over the output format, consider a lightweight WYSIWYG markup language, Such as reStructuredText, AsciiDoc, markdown etc.
This way you can pipe the plane-text format into a converter that will result in postscript, PDF or HTML, that you can then print. This also mostly negates the need to consider the line width for the sake of your printer. The converter will do this for you.