Can't copy from PDF file generated from UWP app - xaml

I'm trying to print from UWP app and following this link
While saving it as a pdf file, it's printing normally. I'm able to copy the letters as well. But when I'm pasting it somewhere else, it is printing something like this: "􀀷􀁋􀁌􀁖􀀃􀁌􀁖􀀃􀀷􀁈􀁖􀁗"
I've tried different fonts as well, but no help.
Here is the XAML I'm trying to print:
<Grid x:Name="PrintableArea" Background="White">
<StackPanel x:Name="TextContent">
<TextBlock TextAlignment="Center" FontFamily="Arial" FontWeight="Bold">
This is Test
</TextBlock >
</StackPanel>
</Grid>
How to fix it?

Whatever you are using to create the PDF is clearly unable to create the PDF file with a ToUnicode CMap.
PDF files usually only embed a subset of a font in order to keep the size down. This generally means that the Encoding applied to the font is non-standard (and it generally isn't ASCII anyway). So for example if you have the text "Hello World" then the character codes would be assigned so that "H" = 1, "e" = 2 and so on.
If you copy and paste that, then you get 1, 2, 3, 3, 4, 5, 6, 4, 7, 3, 8 which will appear as binary.
A PDF file may contain a ToUnicode CMap which maps the character code to Unicode code points, and a PDF viewer application can use that to copy the Unicode code points instead of the character codes, which permits sane copy/paste. But its optional. This is because the original design decisions around PDF were to create a portable viewer, the PDF file should look the same on all consumers, but the designers didn't have editing or copying in mind.

Related

Itextsharp: Export greek letters to PDF? [duplicate]

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:
var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);
The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"
For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".
THE PROBLEM:
First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]
Second observation. You are writing code that contains special characters:
"Testing of letters Č,Ć,Š,Ž,Đ"
That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.
You should write code that doesn't contain special characters, but that use a different notations. For instance:
"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"
This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.
Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).
Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.
Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).
THE SOLUTION:
I have written a small example called CzechExample that results in the following PDF: czech.pdf
I have added the same text twice, but using a different encoding:
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(DEST));
document.open();
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
document.add(p1);
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
document.add(p2);
document.close();
}
To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.
As for your fifth mistake, I introduced two different approaches.
In the first case, I told iText to use code page 1250.
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.
In the second case, I told iText to use Unicode for horizontal writing:
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.
When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:
As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.
You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.
CONCLUSION:
Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)
If you are using the FontProvider, I managed to solve the display of the special characters by setting the registerShippedFreeFonts parameter to true:
FontProvider dfp = new DefaultFontProvider(true, true, false);
See also: https://itextpdf.com/en/resources/books/itext-7-converting-html-pdf-pdfhtml/chapter-6-using-fonts-pdfhtml

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

80-characters / right margin line in Sublime Text 3

You can have 80-characters / right margin line in Netbeans, Text Mate and probably many, many more other IDEs. Is it possible to have it in Sublime Text 3 as well? Any option, plugin etc.?
Yes, it is possible in Sublime Text 2, ST3, and ST4 (which you should really upgrade to if you haven't already). Select View → Ruler → 80 (there are several other options there as well). If you like to actually wrap your text at 80 columns, select View → Word Wrap Column → 80. Make sure that View → Word Wrap is selected.
To make your selections permanent (the default for all opened files or views), open Preferences → Settings and use any of the following rules in the right-side pane:
{
// set vertical rulers in specified columns.
// Use "rulers": [80] for just one ruler
// default value is []
"rulers": [80, 100, 120],
// turn on word wrap for source and text
// default value is "auto", which means off for source and on for text
"word_wrap": true,
// set word wrapping at this column
// default value is 0, meaning wrapping occurs at window width
"wrap_width": 80
}
These settings can also be used in a .sublime-project file to set defaults on a per-project basis, or in a syntax-specific .sublime-settings file if you only want them to apply to files written in a certain language (Python.sublime-settings vs. JavaScript.sublime-settings, for example). Access these settings files by opening a file with the desired syntax, then selecting Preferences → Settings—Syntax Specific.
As always, if you have multiple entries in your settings file, separate them with commas , except for after the last one. The entire content should be enclosed in curly braces { }. Basically, make sure it's valid JSON.
If you'd like a key combo to automatically set the ruler at 80 for a particular view/file, or you are interested in learning how to set the value without using the mouse, please see my answer here.
Finally, as mentioned in another answer, you really should be using a monospace font in order for your code to line up correctly. Other types of fonts have variable-width letters, which means one 80-character line may not appear to be the same length as another 80-character line with different content, and your indentations will look all messed up. Sublime has monospace fonts set by default, but you can of course choose any one you want. Personally, I really like Liberation Mono. It has glyphs to support many different languages and Unicode characters, looks good at a variety of different sizes, and (most importantly for a programming font) clearly differentiates between 0 and O (digit zero and capital letter oh) and 1 and l (digit one and lowercase letter ell), which not all monospace fonts do, unfortunately. Version 2.0 and later of the font are licensed under the open-source SIL Open Font License 1.1 (here is the FAQ).
For this to work, your font also needs to be set to monospace.
If you think about it, lines can't otherwise line up perfectly perfectly.
This answer is detailed at sublime text forum:
http://www.sublimetext.com/forum/viewtopic.php?f=3&p=42052
This answer has links for choosing an appropriate font for your OS,
and gives an answer to an edge case of fonts not lining up.
Another website that lists great monospaced free fonts for programmers.
http://hivelogic.com/articles/top-10-programming-fonts
On stackoverflow, see:
Michael Ruth's answer here:
How to make ruler always be shown in Sublime text 2?
MattDMo's answer here:
What is the default font of Sublime Text?
I have rulers set at the following:
30
50 (git commit message titles should be limited to 50 characters)
72 (git commit message details should be limited to 72 characters)
80 (Windows Command Console Window maxes out at 80 character width)
Other viewing environments that benefit from shorter lines:
github: there is no word wrap when viewing a file online
So, I try to keep .js .md and other files at 70-80 characters.
Windows Console: 80 characters.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.