Related
i'm searching a solution for my problem.
I'm trying to modify a text that appears in a specific position of a pdf document using itextsharp.
Could someone please help?
SOLUTION:
I've resolved writing this:
public bool StampOnPDF(string _PathPDF, string _text, string _Total)
{
string _fileName = Path.GetFileName(_PathPDF);
string oldFile = _PathPDF;
string BackupPDF = _PathPDF.Replace(".pdf", "_old.pdf");
File.Copy(oldFile, BackupPDF);
iTextSharp.text.Rectangle Zone1 = new iTextSharp.text.Rectangle(495, 157, 540, 148);
iTextSharp.text.Rectangle Zone2 = new iTextSharp.text.Rectangle(495, 130, 540, 105);
using (PdfReader reader = new PdfReader(BackupPDF))
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(oldFile, FileMode.Create)))
{
PdfContentByte pbover = stamper.GetOverContent(1);
Zone1.BackgroundColor = BaseColor.WHITE;
pbover.Rectangle(Zone1);
Zone2.BackgroundColor = BaseColor.WHITE;
pbover.Rectangle(Zone2);
// select the font properties
var normalFont = FontFactory.GetFont(FontFactory.HELVETICA, 12);
var boldFont = FontFactory.GetFont(FontFactory.HELVETICA_BOLD, 12);
normalFont.Size = 8;
boldFont.Size = 8;
string text = _testo;
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, normalFont), 300, 180, 0);
text = _Total;
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, boldFont), 523, 115, 0);
ColumnText.ShowTextAligned(pbover, Element.ALIGN_CENTER, new Phrase(text, normalFont), 523, 150, 0);
}
return true;
}
This problem is not a trivial one.
To understand why, let's look at a piece of a PDF document.
[a, -28.7356, p, 27.2652, p, 27.2652, e, -27.2652, a, -28.7356, r, 64.6889, a, -28.7356, n, 27.2652, c, -38.7594, e, 444] TJ
/R10 10.44 Tf
68.16 0.24 Td
[", 17.1965, P, -18.7118, i, -9.35592, l, -9.35592, o, -17.2414, t, -9.35636, ", 17.1965, , 250] TJ
This piece of code tells the viewer to render the word "appearance".
What you see here is each individual letter being rendered.
The syntax being <kerning information> <letter> TJ (=text render instruction).
This should give you an idea of how hard it would be to replace a piece of text by something else.
If you make an existing word shorter, you would need to move all other letters again. This problem is known as "reflowing" text. Reflowing, is not something that can be trivially done with pdf documents. To achieve reflow, you need high level information (such as which words belong to which paragraphs). This level of information is generally not present in a pdf document.
As #mkl indicated, if you simply want to remove the text (perhaps covering it with a black box to indicate it was removed) iText can certainly help you.
If you want to overwrite the text, that's (generally) not possible. It can be done if the word you're replacing it with has the same letters and you don't care that much about layout. (Since a word like "iText" might not take up the same amount of space as "Jazzy").
I am trying to generate the barcode from barcode4j library(code128bean, other barcode beans) and try to add to the existing pdf. The barcode image is getting created locally using the below code.
//Create the barcode bean
Code128Bean code128Bean = new Code128Bean();
final int dpi = 150;
code128Bean.setModuleWidth(UnitConv.in2mm(1.0f / dpi)); //makes the narrow bar
//width exactly one pixel
//bean.setCodeset(2);
code128Bean.doQuietZone(false);
//Open output file
File outputFile = new File("D:/barcode4jcod128.png"); //I dont want to create it
OutputStream code128Stream = new FileOutputStream(outputFile);
try {
//Set up the canvas provider for monochrome PNG output
BitmapCanvasProvider canvas1 = new BitmapCanvasProvider(
code128Stream, "image/x-png", dpi, BufferedImage.TYPE_BYTE_BINARY, false, 0);
//Generate the barcode
code128Bean.generateBarcode(canvas1, "123456");
//Signal end of generation
canvas1.finish();
} finally {
code128Stream.close();
}
My problem is I don't want to create an image and save it locally in filesystem and then add it as image to pdf. I just want to create dynamically i mean just create the barcode image dynamically and add it to the pdf.
How do I set the pagesize (like PDPage.PAGE_SIZE_A4) to the existing PDPages which I retrieved from catalog.getAllPages() method, like (List<PDPage> pages = catalog.getAllPages();)
Can somebody help on this?
Thank you so much for your help Tilman. Here is what i did
public static BufferedImage geBufferedImageForCode128Bean(String barcodeString) {
Code128Bean code128Bean = new Code128Bean();
final int dpi = 150;
code128Bean.setModuleWidth(UnitConv.in2mm(1.0f / dpi)); //makes the narrow bar
code128Bean.doQuietZone(false);
BitmapCanvasProvider canvas1 = new BitmapCanvasProvider(
dpi, BufferedImage.TYPE_BYTE_BINARY, false, 0
);
//Generate the barcode
code128Bean.generateBarcode(canvas1, barcodeString);
return canvas1.getBufferedImage();
}
// main code
PDDocument finalDoc = new PDDocument();
BufferedImage bufferedImage = geBufferedImageForCode128Bean("12345");
PDXObjectImage pdImage = new PDPixelMap(doc, bufferedImage);
PDPageContentStream contentStream = new PDPageContentStream(
finalDoc, pdPage, true, true, true
);
contentStream.drawXObject(pdImage, 100, 600, 50, 20);
contentStream.close();
finalDoc.addPage(pdPage);
finalDoc.save(new File("D:/Test75.pdf"));
The barcode is getting created the but it is created in vertical manner. i would like to see in horizontal manner. Thanks again for your help.
1) add an image to an existing page while keeping the content:
BitmapCanvasProvider canvas1 = new BitmapCanvasProvider(
dpi, BufferedImage.TYPE_BYTE_BINARY, false, 0
);
code128Bean.generateBarcode(canvas1, "123456");
canvas1.finish();
BufferedImage bim = canvas1.getBufferedImage();
PDXObjectImage img = new PDPixelMap(doc, bim);
PDPageContentStream contents = new PDPageContentStream(doc, page, true, true, true);
contents.drawXObject(img, 100, 600, bim.getWidth(), bim.getHeight());
contents.close();
2) set the media box to A4 on an existing page:
page.setMediaBox(PDPage.PAGE_SIZE_A4);
I searched for finding a solution of extracting strings of RightToLeft langueges with iTextSharp, but I could not find any way for it. Is it possible extracting strings of RightToLeft langueges from a pdf file with iTextSharp?
With thanks
EDIT:
This Code has very good result:
private void writePdf2()
{
using (var document = new Document(PageSize.A4))
{
var writer = PdfWriter.GetInstance(document, new FileStream(#"C:\Users\USER\Desktop\Test2.pdf", FileMode.Create));
document.Open();
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
var tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
var reader = new PdfReader(#"C:\Users\USER\Desktop\Test.pdf");
int intPageNum = reader.NumberOfPages;
string text = null;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
text = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(text);
File.WriteAllText("page-" + i + "-text.txt", text.ToString());
}
reader.Close();
ColumnText.ShowTextAligned(
canvas: writer.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
//phrase: new Phrase(new Chunk(text, tahoma)),
x: 300,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
System.Diagnostics.Process.Start(#"C:\Users\USER\Desktop\Test2.pdf");
}
But "phrase: new Phrase(new Chunk(text, tahoma))" does not have correct output for all strings in the PDF. Therefore I used "PdfStamper" to make a PDF which is suitable for "PdfReader" in "iTextSharp".
Reproducing the issue
As initially the OP couldn't provide a sample file, I first tried to reproduce the issue with a file generated by iTextSharp itself.
My test method first creates a PDF using the ColumnText.ShowTextAligned with the string constant which according to the OP returns a good result. Then it extracts the text content of that file. Finally it creates a second PDF containing a line created using the good ColumnText.ShowTextAligned call with the string constant and then several lines created using ColumnText.ShowTextAligned with the extracted string with or without the post-processing instructions from the OP's code (UTF8-encoding and -decoding; applying UnicodeCharacterPlacement) performed.
I could not immediately find the UnicodeCharacterPlacement class the OP uses. So I googled a bit and found one such class here. I hope this is essentially the class used by the OP.
public void ExtractTextLikeUser2509093()
{
string rtlGood = #"C:\Temp\test-results\extract\rtlGood.pdf";
string rtlGoodExtract = #"C:\Temp\test-results\extract\rtlGood.txt";
string rtlFinal = #"C:\Temp\test-results\extract\rtlFinal.pdf";
Directory.CreateDirectory(#"C:\Temp\test-results\extract\");
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
Font tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
// A - Create a PDF with a good RTL representation
using (FileStream fs = new FileStream(rtlGood, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
// B - Extract the text for that good representation and add it to a new PDF
String textA, textB, textC, textD;
using (PdfReader pdfReader = new PdfReader(rtlGood))
{
textA = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new LocationTextExtractionStrategy());
textB = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(textA.ToString()));
textC = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textA);
textD = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textB);
File.WriteAllText(rtlGoodExtract, textA + "\n\n" + textB + "\n\n" + textC + "\n\n" + textD + "\n\n");
}
using (FileStream fs = new FileStream(rtlFinal, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 600,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textA, tahoma)),
x: 500,
y: 550,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textB, tahoma)),
x: 500,
y: 500,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textC, tahoma)),
x: 500,
y: 450,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textD, tahoma)),
x: 500,
y: 400,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
}
The final result:
Thus,
I cannot reproduce the issue. Both the final two variants to me look identical in their Arabic contents with the original line. In particular I could not observe the switch from "سلام" to "سالم". Most likely content of the PDF C:\Users\USER\Desktop\Test.pdf (from which the OP extracted the text in his test) is somehow peculiar and so text extracted from it draws with that switch.
Applying that UnicodeCharacterPlacement class to the extracted text is necessary to get it into the right order.
The other post-processing line,
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
does not make any difference and should not be used.
For further analysis we would need that PDF C:\Users\USER\Desktop\Test.pdf.
Inspecting salamword.pdf
Eventually the OP could provide a sample PDF, salamword.pdf:
I used "PrimoPDF" to create a PDF file with this content: "Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم".
Next I read this PDF file. Then I received this output: "Test. Hello world. Hello people. م . م . م دم ".
Indeed I could reproduce this behavior. So I analyzed the way the Arabic writing was encoded inside...
Some background information to start with:
Fonts in PDFs can have (and in the case at hand do have) a completely custom encoding. In particular embedded subsets often are generated by choosing codes as the characters come, e.g. the first character from a given font used on a page is encoded as 1, the second different as 2, the third different as 3 etc.
Thus, simply extracting the codes of the drawn text does not help very much at all (see below for an example from the file at hand). But a font inside a PDF can bring along some extra information allowing an extractor to map the codes to Unicode values. These information might be
a ToUnicode map providing an immediate map code -> Unicode code point;
an Encoding providing a base encoding (e.g. WinAnsiEncoding) and differences from it in the form of glyph names; these names may be standard names or names only meaningful in the context of the font at hand;
ActualText entries for a structure element or marked-content sequence.
The PDF specification describes a method using the ToUnicode and the Encoding information with standard names to extract text from a PDF and presents ActualText as an alternative way where applicable. The iTextSharp text extraction code implements the ToUnicode / Encoding method with standard names.
Standard names in this context in the PDF specification are character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font.
In the file at hand:
Let's look at the Arabic text in the line written in Arial. The codes used for the glyphs here are:
01 02 03 04 05 01 02 06 07 01 08 02 06 07 01 09 05 0A 0B 01 08 02 06 07
This looks very much like an ad-hoc encoding as described above is used. Thus, using only these information does not help at all.
Thus, let's look at the ToUnicode mapping of the embedded Arial subset:
<01><01><0020>
<02><02><0645>
<03><03><062f>
<04><04><0631>
<08><08><002e>
<0c><0c><0028>
<0d><0d><0077>
<0e><0e><0069>
<0f><0f><0074>
<10><10><0068>
<11><11><0041>
<12><12><0072>
<13><13><0061>
<14><14><006c>
<15><15><0066>
<16><16><006f>
<17><17><006e>
<18><18><0029>
This maps 01 to 0020, 02 to 0645, 03 to 062f, 04 to 0631, 08 to 002e, etc. It does not map 05, 06, 07, etc to anything, though.
So the ToUnicode map only helps for some codes.
Now let's look at the associated encoding
29 0 obj
<</Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[ 1
/space/uni0645/uni062F/uni0631
/uni0645.init/uni06440627.fina/uni0633.init/period
/uni0647.fina/uni0644.medi/uni06A9.init/parenleft
/w/i/t/h
/A/r/a/l
/f/o/n/parenright ]
>>
endobj
The encoding is based on WinAnsiEncoding but all codes of interest are remapped in the Differences. There we find numerous standard glyph names (i.e. character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font) like space, period, w, i, t, etc.; but we also find several non-standard names like uni0645, uni06440627.fina etc.
There appears to be a scheme used for these names, uni0645 represents the character at Unicode code point 0645, and uni06440627.fina most likely represents the characters at Unicode code point 0644 and 0627 in some order in some final form. But still these names are non-standard for the purpose of text extraction according to the method presented by the PDF specification.
Furthermore, there are no ActualText entries in the file at all.
So the reason why only " م . م . م دم " is extracted is that only for these glyphs there are proper information for the standard PDF text extraction method in the PDF.
By the way, if you copy&paste from your file in Adobe Reader you'll get a similar result, and Adobe Reader has a fairly good implementation of the standard text extraction method.
TL;DR
The sample file simply does not contains the information required for text extraction with the method described by the PDF specification which is the method implemented by iTextSharp.
PDFBOX / JSF
Im trying to change the font height of a given text. I know how to change the fontsize only.
PDPageContentStream contentStreambc = new PDPageContentStream(doc1, page, true, true);
contentStreambc.setFont( fonta, 16 );
contentStreambc.beginText();
contentStreambc.moveTextPositionByAmount(200, 320);
contentStreambc.drawString( "abcdef");
contentStreambc.endText();
contentStreambc.close();
The code works fine. But How I change the font height ?
thanks in advance stack members.
If you need something like this
you can create it with this code:
PDRectangle rec = new PDRectangle(220, 70);
PDDocument document = null;
document = new PDDocument();
PDPage page = new PDPage(rec);
document.addPage(page);
PDPageContentStream content = new PDPageContentStream(document, page, true, true);
content.beginText();
content.moveTextPositionByAmount(7, 55);
content.setFont(PDType1Font.HELVETICA, 12);
content.drawString("Normal text (size 12)");
content.setTextMatrix(1, 0, 0, 1.5f, 7, 30);
content.drawString("Stretched text (size 12, factor 1.5)");
content.setTextMatrix(1, 0, 0, 2f, 7, 5);
content.drawString("Stretched text (size 12, factor 2)");
content.endText();
content.close();
document.save("SimplePdfStretchedText.pdf");
The code stretches the text by setting the text matrix accordingly; for details cf. chapter 9 of the PDF specification ISO 32000-1.
PS: As you mention bar codes in a comment to another answer, this should indeed allow you to make higher bar codes while keeping the distances.
I am trying to integrate iTextSharp into an existing Document Imaging application that allows users to rotate individual pages that may have been scanned in at an incorrect angle (it happens more often than I would have thought).
I have the actual page data rotating correctly across 90/180 degrees, but the page orientation is not rotating along with it. I have just started working with iTextSharp, so I'm still a little unfamiliar with its methods, but was able to cobble together what I have so far using posts from StackOverflow. It's close, but not quite there.
Here's what I have so far:
' Get the input document and total number of pages
Dim inputPdf As New iTextSharp.text.pdf.PdfReader(fileName)
Dim pageCount As Integer = inputPdf.NumberOfPages
' Load the input document
Dim inputDoc As New iTextSharp.text.Document(inputPdf.GetPageSizeWithRotation(1))
' Set up the file stream for our output document
Dim outFileName As String = Path.ChangeExtension(fileName, "pdf")
Using fs As New FileStream(outFileName, FileMode.Create)
' Create the output writer
Dim outputWriter As iTextSharp.text.pdf.PdfWriter = iTextSharp.text.pdf.PdfWriter.GetInstance(inputDoc, fs)
inputDoc.Open()
' Copy pages from input to output document
Dim cb As iTextSharp.text.pdf.PdfContentByte = outputWriter.DirectContent
For index As Integer = 1 To pageCount
inputDoc.SetPageSize(inputPdf.GetPageSizeWithRotation(index))
inputDoc.NewPage()
' If this is our page to be rotated, perform the desired transform
' TODO - 90 degree rotations need to change the page orientation as well
Dim page As iTextSharp.text.pdf.PdfImportedPage = outputWriter.GetImportedPage(inputPdf, index)
If index = pageNum Then
Select Case angle
Case 90
cb.AddTemplate(page, 0, -1, 1, 0, 0, page.Height)
Case 180
cb.AddTemplate(page, -1, 0, 0, -1, page.Width, page.Height)
Case 270
cb.AddTemplate(page, 0, 1, -1, 0, page.Width, 0)
Case Else
' Should not be here, but don't do anything
cb.AddTemplate(page, 1, 0, 0, 1, 0, 0)
End Select
Else
' No rotation; add as is
cb.AddTemplate(page, 1, 0, 0, 1, 0, 0)
End If
Next
inputDoc.Close()
End Using
I tried adding the following code to the top to grab the page size from the existing page and swap the dimensions if the rotation angle was 90 or 270:
For index As Integer = 1 To pageCount
Dim pageSize As iTextSharp.text.Rectangle = inputPdf.GetPageSizeWithRotation(index)
If angle = 90 OrElse angle = 270 Then
' For 90-degree rotations, change the orientation of the page, too
pageSize = New iTextSharp.text.Rectangle(pageSize.Height, pageSize.Width)
End If
inputDoc.SetPageSize(pageSize)
inputDoc.NewPage()
Unfortunately, this had the effect of rotating every page 90 degrees, and the data on the page I wanted rotated didn't show up in the correct spot anyway (it was shifted down and off the page a bit).
Like I said, I'm not really familiar with the inner workings of the API. I have checked out the examples online at the sourceforge page, and have taken a look at the book (both editions), but I don't see anything that fits the bill. I saw an example here that shows page orientation for newly-composed PDFs, but nothing for existing ones. Can anybody help me out? Thanks!
You're making this harder than it needs to be.
Rather than rotating the page contents, you want to rotate the page itself:
PdfReader reader = new PdfReader(path);
PdfStamper stamper = new PdfStamper( reader, outStream );
PdfDictionary pageDict = reader.getPageN(desiredPage);
int desiredRot = 90; // 90 degrees clockwise from what it is now
PdfNumber rotation = pageDict.getAsNumber(PdfName.ROTATE);
if (rotation != null) {
desiredRot += rotation.intValue();
desiredRot %= 360; // must be 0, 90, 180, or 270
}
pageDict.put(PdfName.ROTATE, new PdfNumber(desiredRot);
stamper.close();
That's it. You can play around with desiredPage and desiredRot to get whatever effect you're after. Enjoy.