Remove or Overwrite Text on PDF in a specific position - pdf

i'm searching a solution for my problem.
I'm trying to modify a text that appears in a specific position of a pdf document using itextsharp.
Could someone please help?
SOLUTION:
I've resolved writing this:
public bool StampOnPDF(string _PathPDF, string _text, string _Total)
{
string _fileName = Path.GetFileName(_PathPDF);
string oldFile = _PathPDF;
string BackupPDF = _PathPDF.Replace(".pdf", "_old.pdf");
File.Copy(oldFile, BackupPDF);
iTextSharp.text.Rectangle Zone1 = new iTextSharp.text.Rectangle(495, 157, 540, 148);
iTextSharp.text.Rectangle Zone2 = new iTextSharp.text.Rectangle(495, 130, 540, 105);
using (PdfReader reader = new PdfReader(BackupPDF))
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(oldFile, FileMode.Create)))
{
PdfContentByte pbover = stamper.GetOverContent(1);
Zone1.BackgroundColor = BaseColor.WHITE;
pbover.Rectangle(Zone1);
Zone2.BackgroundColor = BaseColor.WHITE;
pbover.Rectangle(Zone2);
// select the font properties
var normalFont = FontFactory.GetFont(FontFactory.HELVETICA, 12);
var boldFont = FontFactory.GetFont(FontFactory.HELVETICA_BOLD, 12);
normalFont.Size = 8;
boldFont.Size = 8;
string text = _testo;
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, normalFont), 300, 180, 0);
text = _Total;
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, boldFont), 523, 115, 0);
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, normalFont), 523, 150, 0);
}
return true;
}

This problem is not a trivial one.
To understand why, let's look at a piece of a PDF document.
[a, -28.7356, p, 27.2652, p, 27.2652, e, -27.2652, a, -28.7356, r, 64.6889, a, -28.7356, n, 27.2652, c, -38.7594, e, 444] TJ
/R10 10.44 Tf
68.16 0.24 Td
[", 17.1965, P, -18.7118, i, -9.35592, l, -9.35592, o, -17.2414, t, -9.35636, ", 17.1965, , 250] TJ
This piece of code tells the viewer to render the word "appearance".
What you see here is each individual letter being rendered.
The syntax being <kerning information> <letter> TJ (=text render instruction).
This should give you an idea of how hard it would be to replace a piece of text by something else.
If you make an existing word shorter, you would need to move all other letters again. This problem is known as "reflowing" text. Reflowing, is not something that can be trivially done with pdf documents. To achieve reflow, you need high level information (such as which words belong to which paragraphs). This level of information is generally not present in a pdf document.
As #mkl indicated, if you simply want to remove the text (perhaps covering it with a black box to indicate it was removed) iText can certainly help you.
If you want to overwrite the text, that's (generally) not possible. It can be done if the word you're replacing it with has the same letters and you don't care that much about layout. (Since a word like "iText" might not take up the same amount of space as "Jazzy").

Related

When using iText to generate a PDF, if I need to switch fonts many times the file size becomes too large

I have a section of my PDF in which I need to use one font for its unicode symbol and the rest of the paragraph should be a different font. (It is something like "1. a 2. b 3. c" where "1." is the unicode symbol/font and "a" is another font) I have followed the method Bruno describes here: iText 7: How to build a paragraph mixing different fonts? and it works fine to generate the PDF. The issue is that the file size of the PDF goes from around 20MB to around 100MB compared to using only one font and one Text element. This section is used repeatedly in the document thousands of times. I am wondering if there is a way to reduce the impact of switching fonts or to reduce the file size of the entire document in some way.
Style creation pseudocode:
Style style1 = new Style();
Style style2 = new Style();
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile1), PdfEncodings.IDENTITY_H, true);
style1.setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile2), "", false);
style2.setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Writing text/paragraph pseudocode:
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph();
loop up to 25 times: {
Text unicodeText = new Text(unicodeSymbol + " ").addStyle(style1);
paragraph.add(unicodeText);
Text plainText = new Text(plainText + " ").addStyle(style2);
paragraph.add(plainText);
}
div.add(paragraph);
This writing of text/paragraph is done thousands of times and makes up most of the document. Basically the document consists of thousands of "buildings" that have corresponding codes and the codes have categories. I need to have the index for the category as the unicode symbol and then all of the corresponding codes within the paragraph for the building.
Here is reproducable code:
float offSet = 50;
Integer leading = 10;
DateFormat format = new SimpleDateFormat("yyyy_MM_dd_kkmmss");
String formattedDate = format.format(new Date());
String path = "/tmp/testing_pdf_"+formattedDate + ".pdf";
File targetPdfFile = new File(path);
PdfWriter writer = new PdfWriter(path, new WriterProperties().addXmpMetadata());
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.LETTER;
Document document = new Document(pdf, pageSize);
document.setMargins(offSet, offSet, offSet, offSet);
byte[] font1file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Garamond-Premier-Pro-Regular.ttf"));
byte[] font2file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Quivira.otf"));
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(font1file), "", true);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(font2file), PdfEncodings.IDENTITY_H, true);
Style style1 = new Style().setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Style style2 = new Style().setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
float columnGap = 5;
float columnWidth = (pageSize.getWidth() - offSet * 2 - columnGap * 2) / 3;
float columnHeight = pageSize.getHeight() - offSet * 2;
Rectangle[] columns = {
new Rectangle(offSet, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth + columnGap, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth * 2 + columnGap * 2, offSet, columnWidth, columnHeight)};
document.setRenderer(new ColumnDocumentRenderer(document, columns));
for (int j = 0; j < 5000; j++) {
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph().setFixedLeading(leading);
// StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 26; i++) {
paragraph.add(new Text("\u3255 ").addStyle(style2));
paragraph.add(new Text("test ").addStyle(style1));
// stringBuilder.append("\u3255 ").append(" test ");
}
// paragraph.add(stringBuilder.toString()).addStyle(style2);
div.add(paragraph);
document.add(div);
}
document.close();
In creating the reproducible code I have found this this is related to the document being tagged. If you remove the line that marks it as tagged it reduces the file size greatly.
You can also reduce the file size by using the commented out string builder with one font instead of two. (Comment out the two "paragraph.add"s in the for-loop) This mirrors the issue I have in my code.
The problem is not in fonts themselves. The issues comes from the fact that you are creating a tagged PDF. Tagged documents have a lot of PDF objects in them that need a lot of space in the file.
I wasn't able to reproduce your 20MB vs 100MB results. On my machine whether with one font or with two fonts, but with two Text elements, the resultant file size is ~44MB.
To compress file when creating large tagged documents, you should use full compression mode which compresses all PDF objects, not only streams.
To activate full compression mode, create a PdfWriter instance with WriterProperties:
PdfWriter writer = new PdfWriter(outFileName,
new WriterProperties().setFullCompressionMode(true));
This setting reduced the file size for me from >40MB to ~5MB.
Please note that you are using iText 7.0.x while 7.1.x line has already been released and is now the main line of iText, so I recommend that you update to the latest version.

PDFs generated with iTextSharp generated watermark giving error

We are applying a watermark using iTextSharp to PDF documents before passing them to client. On some machines (all using v.11 of PDF viewer), the following error is being displayed.
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF Document to correct the problem.
The watermarking code is as follows:
protected static byte[] GetStampedDocument(byte[] content, string mark, string heading)
{
PdfReader reader = new PdfReader(content);
using (MemoryStream stream = new MemoryStream())
{
PdfStamper pdfStamper = new PdfStamper(reader, stream);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
iTextSharp.text.Rectangle pageSize = reader.GetPageSizeWithRotation(i);
PdfContentByte pdfPageContents = pdfStamper.GetOverContent(i);
pdfPageContents.BeginText();
PdfGState gstate = new PdfGState();
gstate.FillOpacity = 0.2f;
gstate.StrokeOpacity = 0.3f;
pdfPageContents.SaveState();
pdfPageContents.SetGState(gstate);
BaseFont baseFont = BaseFont.CreateFont(BaseFont.HELVETICA_BOLD, Encoding.ASCII.EncodingName, false);
pdfPageContents.SetFontAndSize(baseFont, 46);
pdfPageContents.SetRGBColorFill(32, 32, 32);
pdfPageContents.ShowTextAligned(PdfContentByte.ALIGN_CENTER, mark, pageSize.Width / 2, pageSize.Height / 2, 66);
if (heading != null && heading.Length > 0)
{
pdfPageContents.SetFontAndSize(baseFont, 12);
pdfPageContents.SetRGBColorFill(32, 32, 32);
pdfPageContents.ShowTextAligned(PdfContentByte.ALIGN_LEFT, heading, 5, pageSize.Height - 15, 0);
}
pdfPageContents.EndText();
pdfPageContents.RestoreState();
}
pdfStamper.FormFlattening = true;
pdfStamper.FreeTextFlattening = true;
pdfStamper.Close();
return stream.ToArray();
}
}
I cannot recreate this on any machine I have tried so there is an environmental element to this as well I expect.
Any ideas?
You save the graphics state inside a text object:
pdfPageContents.BeginText();
[...]
pdfPageContents.SaveState();
[...]
pdfPageContents.EndText();
pdfPageContents.RestoreState();
This is not allowed, cf. Figure 9 — Graphics objects — in ISO 32000-2, special graphics state operators (like saving or restoring the graphics state) may not be used inside a text object.
To prevent this invalid syntax, move pdfPageContents.SaveState() before pdfPageContents.BeginText(). This furthermore makes the nesting of saving/restoring the state and beginning and ending the text object more natural.

Unable to add margins in iTextSharp document having images

Requirement:
A large image (dynamic) needs to be split and shown in PDF pages. If image can't be accomodated in one page then we need to add another page and try to fit the remaining portion and so on.
So far I am able to split the image in multiple pages, however it appears that they are completely ignoring the margin values and so images are shown without any margins.
Please see below code:
string fileStringReplace = imageByteArray.Replace("data:image/jpeg;base64,", "");
Byte[] imageByte = Convert.FromBase64String(fileStringReplace);
iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageByte);
float w = image.ScaledWidth;
float h = image.ScaledHeight;
float cropHeight = 1500f;
iTextSharp.text.Rectangle page = new iTextSharp.text.Rectangle(1150f, cropHeight);
var x = page.Height;
Byte[] created;
iTextSharp.text.Document document = new iTextSharp.text.Document(page, 20f, 20f, 20f, 40f); --This has no impact
using (var outputMemoryStream = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(document, outputMemoryStream);
writer.CloseStream = false;
document.Open();
PdfContentByte canvas = writer.DirectContentUnder;
float usedHeights = h;
while (usedHeights >= 0)
{
usedHeights -= cropHeight;
document.SetPageSize(new iTextSharp.text.Rectangle(1150f, cropHeight));
canvas.AddImage(image, w, 0, 0, h, 0, -usedHeights);
document.NewPage();
}
document.Close();
created = outputMemoryStream.ToArray();
outputMemoryStream.Write(created, 0, created.Length);
outputMemoryStream.Position = 0;
}
return created;
I also tried to set margin in the loop by document.SetMargins() - but that's not working.
You are mixing different things.
When you create margins, be it while constructing the Document instance or by using the setMargins() method, you create margins for when you let iText(Sharp) decide on the layout. That is: the margins will be respected when you do something like document.Add(image).
However, you do not allow iText to create the layout. You create a PdfContentByte named canvas and you decide to add the image to that canvas using a transformation matrix. This means that you will calculate the a, b, c, d, e, and f value needed for the AddImage() method.
You are supposed to do that Math. If you want to see a margin, then the values w, 0, 0, h, 0, and -usedHeights are wrong, and you shouldn't blame iTextSharp, you should blame your lack of insight in analytical geometrics (that's the stuff you learn in high school at the age of 16).
This might be easier for you:
iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageByte);
float w = image.ScaledWidth;
float h = image.ScaledHeight;
// For the sake of simplicity, I don't crop the image, I just add 20 user units
iTextSharp.text.Rectangle page = new iTextSharp.text.Rectangle(w + 20, h + 20);
iTextSharp.text.Document document = new iTextSharp.text.Document(page);
PdfWriter writer = PdfWriter.GetInstance(document, outputMemoryStream);
// Please drop the line that prevents closing the output stream!
// Why are so many people making this mistake?
// Who told you you shouldn't close the output stream???
document.Open();
// We define an absolute position for the image
// it will leave a margin of 10 to the left and to the bottom
// as we created a page that is 20 user units to wide and to high,
// we will also have a margin of 10 to the right and to the top
img.SetAbsolutePosition(10, 10);
document.Add(Image);
document.Close();
Note that SetAbsolutePosition() also lets you take control, regardless of the margins, as an alternative, you could use:
iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageByte);
float w = image.ScaledWidth;
float h = image.ScaledHeight;
// For the sake of simplicity, I don't crop the image, I just add 20 user units
iTextSharp.text.Rectangle page = new iTextSharp.text.Rectangle(w + 20, h + 20);
iTextSharp.text.Document document = new iTextSharp.text.Document(page, 10, 10, 10, 10);
PdfWriter writer = PdfWriter.GetInstance(document, outputMemoryStream);
// Please drop the line that prevents closing the output stream!
// Why are so many people making this mistake?
// Who told you you shouldn't close the output stream???
document.Open();
// We add the image to the document, and we let iTextSharp decide where to put it
// As there is just sufficient space to fit the image inside the page, it should fit,
// But be aware of the existence of a leading; that could create side-effects
// such as forwarding the image to the next page because it doesn't fit vertically
document.Add(Image);
document.Close();

Extracting strings of RightToLeft langueges with iTextSharp from a pdf file

I searched for finding a solution of extracting strings of RightToLeft langueges with iTextSharp, but I could not find any way for it. Is it possible extracting strings of RightToLeft langueges from a pdf file with iTextSharp?
With thanks
EDIT:
This Code has very good result:
private void writePdf2()
{
using (var document = new Document(PageSize.A4))
{
var writer = PdfWriter.GetInstance(document, new FileStream(#"C:\Users\USER\Desktop\Test2.pdf", FileMode.Create));
document.Open();
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
var tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
var reader = new PdfReader(#"C:\Users\USER\Desktop\Test.pdf");
int intPageNum = reader.NumberOfPages;
string text = null;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
text = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(text);
File.WriteAllText("page-" + i + "-text.txt", text.ToString());
}
reader.Close();
ColumnText.ShowTextAligned(
canvas: writer.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
//phrase: new Phrase(new Chunk(text, tahoma)),
x: 300,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
System.Diagnostics.Process.Start(#"C:\Users\USER\Desktop\Test2.pdf");
}
But "phrase: new Phrase(new Chunk(text, tahoma))" does not have correct output for all strings in the PDF. Therefore I used "PdfStamper" to make a PDF which is suitable for "PdfReader" in "iTextSharp".
Reproducing the issue
As initially the OP couldn't provide a sample file, I first tried to reproduce the issue with a file generated by iTextSharp itself.
My test method first creates a PDF using the ColumnText.ShowTextAligned with the string constant which according to the OP returns a good result. Then it extracts the text content of that file. Finally it creates a second PDF containing a line created using the good ColumnText.ShowTextAligned call with the string constant and then several lines created using ColumnText.ShowTextAligned with the extracted string with or without the post-processing instructions from the OP's code (UTF8-encoding and -decoding; applying UnicodeCharacterPlacement) performed.
I could not immediately find the UnicodeCharacterPlacement class the OP uses. So I googled a bit and found one such class here. I hope this is essentially the class used by the OP.
public void ExtractTextLikeUser2509093()
{
string rtlGood = #"C:\Temp\test-results\extract\rtlGood.pdf";
string rtlGoodExtract = #"C:\Temp\test-results\extract\rtlGood.txt";
string rtlFinal = #"C:\Temp\test-results\extract\rtlFinal.pdf";
Directory.CreateDirectory(#"C:\Temp\test-results\extract\");
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
Font tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
// A - Create a PDF with a good RTL representation
using (FileStream fs = new FileStream(rtlGood, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
// B - Extract the text for that good representation and add it to a new PDF
String textA, textB, textC, textD;
using (PdfReader pdfReader = new PdfReader(rtlGood))
{
textA = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new LocationTextExtractionStrategy());
textB = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(textA.ToString()));
textC = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textA);
textD = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textB);
File.WriteAllText(rtlGoodExtract, textA + "\n\n" + textB + "\n\n" + textC + "\n\n" + textD + "\n\n");
}
using (FileStream fs = new FileStream(rtlFinal, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 600,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textA, tahoma)),
x: 500,
y: 550,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textB, tahoma)),
x: 500,
y: 500,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textC, tahoma)),
x: 500,
y: 450,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textD, tahoma)),
x: 500,
y: 400,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
}
The final result:
Thus,
I cannot reproduce the issue. Both the final two variants to me look identical in their Arabic contents with the original line. In particular I could not observe the switch from "سلام" to "سالم". Most likely content of the PDF C:\Users\USER\Desktop\Test.pdf (from which the OP extracted the text in his test) is somehow peculiar and so text extracted from it draws with that switch.
Applying that UnicodeCharacterPlacement class to the extracted text is necessary to get it into the right order.
The other post-processing line,
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
does not make any difference and should not be used.
For further analysis we would need that PDF C:\Users\USER\Desktop\Test.pdf.
Inspecting salamword.pdf
Eventually the OP could provide a sample PDF, salamword.pdf:
I used "PrimoPDF" to create a PDF file with this content: "Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم".
Next I read this PDF file. Then I received this output: "Test. Hello world. Hello people. م . م . م دم ".
Indeed I could reproduce this behavior. So I analyzed the way the Arabic writing was encoded inside...
Some background information to start with:
Fonts in PDFs can have (and in the case at hand do have) a completely custom encoding. In particular embedded subsets often are generated by choosing codes as the characters come, e.g. the first character from a given font used on a page is encoded as 1, the second different as 2, the third different as 3 etc.
Thus, simply extracting the codes of the drawn text does not help very much at all (see below for an example from the file at hand). But a font inside a PDF can bring along some extra information allowing an extractor to map the codes to Unicode values. These information might be
a ToUnicode map providing an immediate map code -> Unicode code point;
an Encoding providing a base encoding (e.g. WinAnsiEncoding) and differences from it in the form of glyph names; these names may be standard names or names only meaningful in the context of the font at hand;
ActualText entries for a structure element or marked-content sequence.
The PDF specification describes a method using the ToUnicode and the Encoding information with standard names to extract text from a PDF and presents ActualText as an alternative way where applicable. The iTextSharp text extraction code implements the ToUnicode / Encoding method with standard names.
Standard names in this context in the PDF specification are character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font.
In the file at hand:
Let's look at the Arabic text in the line written in Arial. The codes used for the glyphs here are:
01 02 03 04 05 01 02 06 07 01 08 02 06 07 01 09 05 0A 0B 01 08 02 06 07
This looks very much like an ad-hoc encoding as described above is used. Thus, using only these information does not help at all.
Thus, let's look at the ToUnicode mapping of the embedded Arial subset:
<01><01><0020>
<02><02><0645>
<03><03><062f>
<04><04><0631>
<08><08><002e>
<0c><0c><0028>
<0d><0d><0077>
<0e><0e><0069>
<0f><0f><0074>
<10><10><0068>
<11><11><0041>
<12><12><0072>
<13><13><0061>
<14><14><006c>
<15><15><0066>
<16><16><006f>
<17><17><006e>
<18><18><0029>
This maps 01 to 0020, 02 to 0645, 03 to 062f, 04 to 0631, 08 to 002e, etc. It does not map 05, 06, 07, etc to anything, though.
So the ToUnicode map only helps for some codes.
Now let's look at the associated encoding
29 0 obj
<</Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[ 1
/space/uni0645/uni062F/uni0631
/uni0645.init/uni06440627.fina/uni0633.init/period
/uni0647.fina/uni0644.medi/uni06A9.init/parenleft
/w/i/t/h
/A/r/a/l
/f/o/n/parenright ]
>>
endobj
The encoding is based on WinAnsiEncoding but all codes of interest are remapped in the Differences. There we find numerous standard glyph names (i.e. character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font) like space, period, w, i, t, etc.; but we also find several non-standard names like uni0645, uni06440627.fina etc.
There appears to be a scheme used for these names, uni0645 represents the character at Unicode code point 0645, and uni06440627.fina most likely represents the characters at Unicode code point 0644 and 0627 in some order in some final form. But still these names are non-standard for the purpose of text extraction according to the method presented by the PDF specification.
Furthermore, there are no ActualText entries in the file at all.
So the reason why only " م . م . م دم " is extracted is that only for these glyphs there are proper information for the standard PDF text extraction method in the PDF.
By the way, if you copy&paste from your file in Adobe Reader you'll get a similar result, and Adobe Reader has a fairly good implementation of the standard text extraction method.
TL;DR
The sample file simply does not contains the information required for text extraction with the method described by the PDF specification which is the method implemented by iTextSharp.

PDF Watermark for printing only, programmatically

I can watermark any PDF already, and the images inside, everything ok, but now I need the watermark only showing up when the PDF is printed... Is this possible? How?
I need to do this programmatically of course.
For future readers, this is possible to do by wrapping the watermark in a PDF layer (Optional Content Group), then configuring the Usage attribute of this layer as Print-Only. See the PDF Reference Document, Chapter 4-Graphics, part 4.10-Optional Content for more details.
Specifically, using itextsharp, I was able to get it working with the following, specifically - pdf version 1.7, and SetPrint("Watermark",true)
string oldfile = #"c:\temp\oldfile.pdf";
string newFile = #"c:\temp\newfile.pdf";
PdfReader pdfReaderS = new PdfReader(oldfile);
Document document = new Document(pdfReaderS.GetPageSizeWithRotation(1));
PdfWriter pdfWriterD = PdfWriter.GetInstance(document, new FileStream(newFile, FileMode.Create, FileAccess.Write));
pdfWriterD.SetPdfVersion(PdfWriter.PDF_VERSION_1_7);
document.Open();
PdfContentByte pdfContentByteD = pdfWriterD.DirectContent;
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
int n = pdfReaderS.NumberOfPages;
string text = "UNCONTROLLED";
for (int i = 1; i <= n; i++)
{
iTextSharp.text.Rectangle pageSizeS = pdfReaderS.GetPageSizeWithRotation(i);
float pageWidth = pageSizeS.Width / 2;
float pageheight = pageSizeS.Height / 2;
document.SetPageSize(pageSizeS);
document.NewPage();
PdfImportedPage pdfImportedPage = pdfWriterD.GetImportedPage(pdfReaderS, i);
PdfLayer layer1 = new PdfLayer("Watermark", pdfWriterD);
layer1.SetPrint("Watermark", true);
layer1.View = false;
layer1.On = false;
layer1.OnPanel = false;
pdfContentByteD.BeginLayer(layer1);
pdfContentByteD.SetColorFill(BaseColor.RED);
pdfContentByteD.SetFontAndSize(bf, 30);
ColumnText.ShowTextAligned(pdfContentByteD, Element.ALIGN_CENTER, new Phrase(text), 300, 700, 0);
pdfContentByteD.EndLayer();
pdfContentByteD.AddTemplate(pdfImportedPage, 0, 0);//, 0, 1, 0, 0);
}
document.Close();
pdfReaderS.Close();
You should probably make use of the fact that the screen uses RGB and the printer CMYK. You should be able to create two colors in CMYK that map to the same RGB value. This is of course not enough against a determined specialist.
The bOnScreen parameter determines whether the watermark will be displayed when the PDF is viewed on the computer screen, and bOnPrint determines whether it will be displayed when the PDF is printed.
-- https://acrobatusers.com/tutorials/watermarking-a-pdf-with-javascript