Formatting plain text output for printing? - formatting

I have a program that outputs a report into plain text. The report must be plain text for it to load into a third party program. The report also needs to be printable.
When dealing with plain text, what limits should I set on line size and number of lines on a page to get it to print reasonably?

It definitely depends on the font you use when printing, and unless you have control over that you can't guarantee it will print nicely. For example, in Word 2007, creating a blank document and setting the font to Courier New 10pt only fits 77 characters per line and 28 lines per page. Changing the margins and line spacing will modify that. However if you used that and they tried to print from Wordpad it wouldn't work because the default with Courier New 10pt only fits 72 characters per line. In either case, the standard 80 characters doesn't work. Those defaults aren't even global defaults.
The best you can probably do is pick a size and provide instructions on printing the report with several common editors so it will look acceptable. Specify the font, margin, line spacing, etc.
Have you considered other options, like creating two files? One could be plain text for import into the other program. The other could be a format better suited for printing.

its going to depend on your printing font, you want to aim for a fixed width font so that it is consistent, 80 columns is generally safe.... i think....
edit: here is a quick guide I googled - http://dsl.org/cookbook/cookbook_17.html

If you have control over the output format, consider a lightweight WYSIWYG markup language, Such as reStructuredText, AsciiDoc, markdown etc.
This way you can pipe the plane-text format into a converter that will result in postscript, PDF or HTML, that you can then print. This also mostly negates the need to consider the line width for the sake of your printer. The converter will do this for you.

Related

PDF programatically set font without being forced to set font size

According to the PDF syntax given by Adobe here it seems that there is no possibility to set the actual font, without setting at the same time also the font size.
I am in the situation that the font size is already specified and set before together with a different font. I just want to keep the size just alter the font.
Exemplarily looks like this
/F1 12 Tf
where F1 "reprents" the font, the font size being 12.
Did I miss something, or is there a nice workaround for it?
Please note, that I have no access to the current font size and don't want to alter it.
Generally it requires a PDF editor to do font replacement, often with mixed results, as there is no simple means to substitute one font tag for another here I crafted a PDF to show how just simply swapping /F1 /F2 /F3 and /F4 may corrupt the output.
So line 1 and 2 are different styles of Arial font and lines 3 and 4 are Consolas so we can see as I cycle the font tags I might be able to replace one style by another but here only once in each font family and clearly line length will be changed.
So IF you know you want to swap style later it is essential to use proportional / fixed pitch static fonts.
IF you plan ahead it is possible to embed fonts for swapping style but generally not family, OR you could as many do set a generic font and rely on the system replacing those with local default fonts.
Otherwise you need to read the text in one font and rewrite that text block in a different font bearing in mind the need to reflow the line lengths or adjust sizes to suit, hence many a font is found with non uniform units.
For fairly simple font substitution I would normally resort to X-change editor or FlexiPDF editor

How to convert italic font to normal font in pdf using some library?

Is there any way to convert Italic font, Bold font in my pdf to normal font using some library like Imagemagick or GhostScript etc. ?
Basically the answer is 'no' though there are several levels of caveat in there.
The most common scenario for a PDF file is that it contains an embedded font, and that font is subset. In this case the font will use a custom Encoding, so that when you see 'Hello' on your monitor, the actual character codes might be 'Axtte' or similar gibberish. If the font also contain a ToUnicode table you could, technically, create an embedded subset of the regular font from the same family as the bold or italic and embed that, and it would work. This would be an immense amount of work.
If the font isn't subset then it may not contain a custom Encoding, which would make that task easier, because you wouldn't have to re-encode the replacement.
If the font isn't embedded, then you need only change the font name in the Font object, because the PDF consumer will have to find a substitute anyway.
Note that, because PDF is a binary format, with an index (xref) containing the offset of every object in the file, any changes will mean that the xref table has to be reconstructed, again a considerable task.
I'm not aware of any tools which would do any of this for you automatically, you'd have to write your own, though some things could be done automatically. MuPDF for example will 'fix' a PDF file which has an incorrect xref table for you.
And even after all that, the likelihood is that the spacing would be different for the italic or bold font compared to the regular font anyway, and would look peculiar if you replaced them with a regular font.
So, fundamentally, no.
In low-level PDF you can apply some rendering flags in front of a text stream. Like the "Rendering Mode" Tr operation. For instance, in this scenario you can include the rendering of text outline and increase outline drawing width with the command sequence 0.4 w 2 Tr which will cause Normal text to become more "bold" (There are other better ways to accomplish this using the Font Description dictionary). However, one can also employ this tactic to slim down bold text using a clipped thicker outline, but this may not be ideal.
As for italic, most fonts contain a metric indicating their italic angle, and you can use this to add a faux italic using a shear CTM transformation matrix with the cm operation. Once again, this may work better to add an italic shear, but may also have some success in removing it.
See the PDF Reference.
This will require a library with lower level PDF building and you would have to do it manually, but it is possible technically.

extracting italic word from PDF using iText [duplicate]

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.
I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.