iTextSharp Read Text From Single Layer of PDF - pdf

Currently I am using a custom LocationTextExtractionStrategy to extract text from a PDF that returns a TextRenderInfo[]. I would like to be able to determine if a TextRenderInfo object (or PDFString, child of TextRenderInfo) appears in a specific layer. I am not sure if this is possible. To get the layers in a PDF, I am using:
Dictionary<string,PdfLayer> layers;
using (var pdfReader = new PdfReader(src))
{
var newSrc = Path.Combine(["new file location"]);
using (var stream = new FileStream(newSrc, FileMode.Create))
{
PdfStamper stamper = new PdfStamper(pdfReader, stream);
layers = stamper.GetPdfLayers();
stamper.Close();
}
pdfReader.Close();
src = newSrc;
}
To extract the text, I am using:
var textExtractor = new TextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, pdfPageNum,textExtractor);
List<TextRenderInfo> results = textExtractor.Results;
Is there any way that I can check if the individual TextRenderInfo results exist within the layers obtained in the first code snippet. Any help would be much appreciated.

It is possible to get the contents from a single layer, but you'll have to jump through a few hoops to work it out. Specifically, you will have to recreate some of the logic that is provided by the PdfTextExtractor and PdfReaderContentParser.
public static String GetText(PdfReader reader, int pageNumber, int streamNumber) {
var strategy = new LocationTextExtractionStrategy();
var processor = new PdfContentStreamProcessor(strategy);
var resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
// assuming you still need to extract the page bytes
byte[] contents = GetContentBytesForPageStream(reader, pageNumber, streamNumber);
processor.ProcessContent(contents, resourcesDic);
return strategy.GetResultantText();
}
public static byte[] GetContentBytesForPageStream(PdfReader reader, int pageNumber, int streamNumber) {
PdfDictionary pageDictionary = reader.GetPageN(pageNum);
PdfObject contentObject = pageDictionary.Get(PdfName.CONTENTS);
if (contentObject == null)
return new byte[0];
byte[] contentBytes = GetContentBytesFromContentObject(contentObject, streamNumber);
return contentBytes;
}
public static byte[] GetContentBytesFromContentObject(PdfObject contentObject, int streamNumber) {
// copy-paste logic from
// ContentByteUtils.GetContentBytesFromContentObject(contentObject);
// but in case PdfObject.ARRAY: only select the streamNumber you require
}
If you're specifically looking to just use PdfTextExtractor or PdfReaderContentParser, and ask the returned TextRenderInfo for the layer it's on, then I'm not sure it will be easily possible. There are a number of problems with that:
TextRenderInfo doesn't store that information, so you'd have to subclass it (which is possible)
you'd have to rewrite the logic that creates the TextRenderInfo objects. This is possible by registering custom IContentOperator objects for all text operators (Tj, TJ, ' and ") with the PdfTextExtractor or PdfReaderContentParser
the hardest part is that you have already lost layer information in ContentByteUtils.GetContentBytesFromContentObject - so you'd need to retain that somehow, which creates its own set of problems.

Related

Error using OpenXML to read a .docx file from a memorystream to a WordprocessingDocument to a string and back

I have an existing library that I can use to receive a docx file and return it. The software is .Net Core hosted in a Linux Docker container.
It's very limited in scope though and I need to perform some actions it can't do. As these are straightforward I thought I would use OpenXML, and for my proof of concept all I need to do is to read a docx as a memorystream, replace some text, turn it back into a memorystream and return it.
However the docx that gets returned is unreadable. I've commented out the text replacement below to eliminate that, and if I comment out the call to the method below then the docx can be read so I'm sure the issue is in this method.
Presumably I'm doing something fundamentally wrong here but after a few hours googling and playing around with the code I am not sure how to correct this; any ideas what I have wrong?
Thanks for the help
private MemoryStream SearchAndReplace(MemoryStream mem)
{
mem.Position = 0;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(mem, true))
{
string docText = null;
StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream());
docText = sr.ReadToEnd();
//Regex regexText = new Regex("Hello world!");
//docText = regexText.Replace(docText, "Hi Everyone!");
MemoryStream newMem = new MemoryStream();
newMem.Position = 0;
StreamWriter sw = new StreamWriter(newMem);
sw.Write(docText);
return newMem;
}
}
If your real requirement is to search and replace text in a WordprocessingDocument, you should have a look at this answer.
The following unit test shows how you can make your approach work if the use case really demands that you read a string from a part, "massage" the string, and write the changed string back to the part. It also shows one of the shortcomings of any other approach than the one described in the answer already mentioned above, e.g., by demonstrating that the string "Hello world!" will not be found in this way if it is split across w:r elements.
[Fact]
public void CanSearchAndReplaceStringInOpenXmlPartAlthoughThisIsNotTheWayToSearchAndReplaceText()
{
// Arrange.
using var docxStream = new MemoryStream();
using (var wordDocument = WordprocessingDocument.Create(docxStream, WordprocessingDocumentType.Document))
{
MainDocumentPart part = wordDocument.AddMainDocumentPart();
var p1 = new Paragraph(
new Run(
new Text("Hello world!")));
var p2 = new Paragraph(
new Run(
new Text("Hello ") { Space = SpaceProcessingModeValues.Preserve }),
new Run(
new Text("world!")));
part.Document = new Document(new Body(p1, p2));
Assert.Equal("Hello world!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
// Act.
SearchAndReplace(docxStream);
// Assert.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Paragraph p1 = part.Document.Descendants<Paragraph>().First();
Paragraph p2 = part.Document.Descendants<Paragraph>().Last();
Assert.Equal("Hi Everyone!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
}
private static void SearchAndReplace(MemoryStream docxStream)
{
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, true))
{
// If you wanted to read the part's contents as text, this is how you
// would do it.
string partText = ReadPartText(wordDocument.MainDocumentPart);
// Note that this is not the way in which you should search and replace
// text in Open XML documents. The text might be split across multiple
// w:r elements, so you would not find the text in that case.
var regex = new Regex("Hello world!");
partText = regex.Replace(partText, "Hi Everyone!");
// If you wanted to write changed text back to the part, this is how
// you would do it.
WritePartText(wordDocument.MainDocumentPart, partText);
}
docxStream.Seek(0, SeekOrigin.Begin);
}
private static string ReadPartText(OpenXmlPart part)
{
using Stream partStream = part.GetStream(FileMode.OpenOrCreate, FileAccess.Read);
using var sr = new StreamReader(partStream);
return sr.ReadToEnd();
}
private static void WritePartText(OpenXmlPart part, string text)
{
using Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write);
using var sw = new StreamWriter(partStream);
sw.Write(text);
}

itextsharp split shared xObject streams

I am trying to split shared XObject streams (originally flatten form fields with the same content) in the PDF.
What is the correct way to do this using itextsharp? I am trying the code below but the stream is still shared in the resulting document.
Sample pdf with shared XObject streams flatten.pdf
PdfReader pdf = new PdfReader(path);
PdfStamper stamper = new PdfStamper(pdf, new FileStream("processed.pdf", FileMode.OpenOrCreate, FileAccess.ReadWrite));
EliminateSharedStreams(stamper, 1);
stamper.Close();
virtual public void EliminateSharedXObjectStreams(PdfStamper pdfStamper, int pageNum)
{
PdfReader pdfReader = pdfStamper.Reader;
PdfDictionary page = pdfReader.GetPageN(pageNum);
PdfDictionary resources = page.GetAsDict(PdfName.RESOURCES);
PdfDictionary xObjResources = resources.GetAsDict(PdfName.XOBJECT);
List<PRIndirectReference> newRefs = new List<PRIndirectReference>();
List<PdfName> newNames = new List<PdfName>();
List<PRStream> newStreams = new List<PRStream>();
IntHashtable visited = new IntHashtable();
foreach (PdfName key in xObjResources.Keys)
{
PdfStream xObj = xObjResources.GetAsStream(key);
if (xObj is PRStream && xObj.GetAsName(PdfName.SUBTYPE) != null &&
xObj.GetAsName(PdfName.SUBTYPE).CompareTo(PdfName.FORM) == 0)
{
PRIndirectReference refi = (PRIndirectReference)xObjResources.Get(key);
PRStream xFormStream = (PRStream)xObj;
if (visited.ContainsKey(refi.Number))
{
// need to duplicate
newRefs.Add(refi);
PRStream newStream = new PRStream(xFormStream, null);
newStreams.Add(newStream);
newNames.Add(key);
}
else
visited[xFormStream.ObjNum] = 1;
}
}
if (newStreams.Count == 0)
return;
PdfContentByte canvas = pdfStamper.GetOverContent(pageNum);
PdfWriter writer = pdfStamper.Writer;
for (int k = 0; k < newStreams.Count; ++k)
{
canvas.SaveState();
//add copied stream
PdfIndirectReference newRef = writer.AddToBody(newStreams[k]).IndirectReference;
//change the ref
xObjResources.Put(newNames[k], newRef);
canvas.RestoreState();
}
}
First remarks without a sample document
There are numerous reasons why your code may not work as desired. As you did not supply your sample PDF, I cannot tell which are more relevant and which are not.
You only search xobjects shared on the same page; if a xobject is once used on page one and once on page two, your code cannot identify this.
If you want to be able to find such shares, you'll at least have to use the same IntHashtable instance visited across all calls of EliminateSharedXObjectStreams for the same PdfStamper pdfStamper, e.g. by creating it once outside this method and making it a parameter of your method.
You only check for shared xobjects in the immediate page resources. But form xobjects have their own resources which can contain even more form xobject declarations.
If you want to find such shares, you'll have to recurse into the resources of your page's xobjects, those xobjects' xobjects, etc. pp.
(Strictly speaking you also have to recurse into the form xobjects of patterns and Type 3 Font glyph definitions, but these are unlikely positions to flatten form fields into.)
You only check for shared xobjects with different names. But xobjects can also be shared by referencing the same name multiple times from the same content stream.
If you want to find such shares, you have to analyse the content streams in question to find duplicate usages of the form xobject with the same name.
(By the way, doing so you may also check whether declared xobjects are used at all: if a form xobject is declared in some resources, this does not mean it is used in the context of these resources, it may be an unused resource.)
You don't mark xObjResources (if it itself is indirect) or page (otherwise) as used. If your PdfStamper pdfStamper is working in append mode, your changes may be ignored.
Solution with a sample document
After you provided the information that
It's single page document containing shared streams (xobjects with different names) in the immediate page resources. pdfStamper is not in append mode.
it turned out that the problems mentioned above are not relevant in your case. As you meanwhile also have provided an example document, I could reproduce the issue.
Indeed, your code does not split the shared XObjects. The reason is that the PdfStamper is made for manipulating the PDF in the PdfReader in the state it was in when the stamper was constructed, using stamper methods only. Your code, on the other hand, manipulates objects directly retrieved from the PdfReader after the construction of the stamper. Thus, while your new streams are added to the PDF (actually up front), the changes in the pre-existing XObject resource dictionaries don't make it to the result.
If you want to manipulate objects you retrieve from the reader, you instead should do this before creating the stamper.
This actually should suite you as your code structurally is copied from a Pdfreader method anyways, EliminateSharedStreams, which you adapted to your use case.
The only problem is that that method uses a hidden member variable of the PdfReader class. But you can access that variable bei means of reflection.
Thus, the manipulated method (working on a pure PdfReader) could look like this:
virtual public void EliminateSharedXObjectStreams(PdfReader pdfReader, int pageNum)
{
PdfDictionary page = pdfReader.GetPageN(pageNum);
PdfDictionary resources = page.GetAsDict(PdfName.RESOURCES);
PdfDictionary xObjResources = resources.GetAsDict(PdfName.XOBJECT);
List<PRIndirectReference> newRefs = new List<PRIndirectReference>();
List<PRStream> newStreams = new List<PRStream>();
IntHashtable visited = new IntHashtable();
foreach (PdfName key in xObjResources.Keys)
{
PdfStream xObj = xObjResources.GetAsStream(key);
if (xObj is PRStream && xObj.GetAsName(PdfName.SUBTYPE) != null &&
xObj.GetAsName(PdfName.SUBTYPE).CompareTo(PdfName.FORM) == 0)
{
PRIndirectReference refi = (PRIndirectReference)xObjResources.Get(key);
PRStream xFormStream = (PRStream)xObj;
if (visited.ContainsKey(refi.Number))
{
// need to duplicate
newRefs.Add(refi);
PRStream newStream = new PRStream(xFormStream, null);
newStreams.Add(newStream);
}
else
visited[xFormStream.ObjNum] = 1;
}
}
if (newStreams.Count == 0)
return;
FieldInfo xrefObjField = typeof(PdfReader).GetField("xrefObj", BindingFlags.Instance | BindingFlags.NonPublic);
List<PdfObject> xrefObj = (List<PdfObject>)xrefObjField.GetValue(pdfReader);
for (int k = 0; k < newStreams.Count; ++k)
{
xrefObj.Add(newStreams[k]);
PRIndirectReference refi = newRefs[k];
refi.SetNumber(xrefObj.Count - 1, 0);
}
}
and you can use it like this:
using (PdfReader pdfReader = new PdfReader(sourcePath))
using (Stream pdfStream = new FileStream(targetPath, FileMode.Create, FileAccess.Write))
{
EliminateSharedXObjectStreams(pdfReader, 1);
PdfStamper pdfStamper = new PdfStamper(pdfReader, pdfStream);
pdfStamper.Close();
}
in particular calling EliminateSharedXObjectStreams before constructing the PdfStamper.
If you are after a generic solution, you of course will have to extend the method to remove the restrictions observed in the first part of the answer...
Solution without reflection
The OP found out:
Manipulating PdfReader works as expected. Only thing is that instead of using xrefObj private field, the stream can be add using AddPdfObject:
for (int k = 0; k < newStreams.Count; ++k)
{
PRIndirectReference newRef = pdfReader.AddPdfObject(newStreams[k]);
PRIndirectReference refi = newRefs[k];
refi.SetNumber(newRef.Number, 0);
}
Indeed, this improves the solution substantially.

iText LocationTextExtractionStrategy/HorizontalTextExtractionStrategy splits text into single characters

I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
For example "Detail" is not any more one object within the locationalResult list but splitted into six items (D, e, t, a, i, l)
I tried using the HorizontalTextExtractionStrategy by making the getLocationalResult method public:
public List<TextChunk> GetLocationalResult()
{
return (List<TextChunk>)locationalResultField.GetValue(this);
}
and using the PdfReaderContentParser to extract the texts:
reader = new PdfReader("some_pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(i, HorizontalTextExtractionStrategy());
foreach (HorizontalTextExtractionStrategy.HorizontalTextChunk chunk in strategy.GetLocationalResult())
{
// Do something with the chunk
}
but this also returns the same result.
Is there any other way to extract connected texts from a pdf?
I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
This problem is due to wrong expectations concerning the contents of the LocationTextExtractionStrategy.locationalResult private list member variable.
This list of TextChunk instances contains the pieces of text as they were forwarded to the strategy from the parsing framework (or probably as they were preprocessed by some filter class), and the framework forwards each single string it encounters in a content stream separately.
Thus, if a seemingly connected word in the content stream actually is drawn using multiple strings, you get multiple TextChunk instances for it.
There actually is some "intelligence" in the method getResultantText joining these chunks properly, adding a space where necessary and so on.
In case of your document, "DETAIL " usually is drawn like this:
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
As you see there are slight text insertion point moves between 'D' and 'E', 'T' and 'A', 'I' and 'L', and 'L' and ' '. (Such mini moves usually represent kerning.) Thus, you'll get individual TextChunk instances for 'D', 'ET', 'AI', and 'L '.
Admittedly, the LocationTextExtractionStrategy.locationalResult member is not very well documented; but as it is a private member, this IMHO is forgivable.
That this worked well for many documents is due to many PDF creators not applying kerning and simply drawing connected text using single string objects.
The HorizontalTextExtractionStrategy is derived from the LocationTextExtractionStrategy and mainly differs from it in the way it arranges the TextChunk instances to a single string. Thus, you'll see the same fragmentation here.
Is there any other way to extract connected texts from a pdf?
If you want "connected texts" as in "atomic string objects in the content stream", you already have them.
If you want "connected texts" as in "visually connected texts, no matter where the constituent letters are drawn in the content stream", you have to glue those TextChunk instances together like the LocationTextExtractionStrategy and HorizontalTextExtractionStrategy do in getResultantText in combination with the comparison methods in their respective TextChunkLocationDefaultImp and HorizontalTextChunkLocation implementations.
After debugging deep into the iTextSharp library I figured out that my texts are drawn with the TJ operator as mkl also mentioned.
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
iText processes these texts not as a single PdfString but as an array of PdfObjects which ends up in calling renderListener.RenderText(renderInfo) for each PdfString item in it (see ShowTextArray class and DisplayPdfString method). In the RenderText method however the information about the relation of the pdf strings within the array got lost and every item is added to locationalResult as an independent object.
As my goal is to extract the "argument of a single text drawing instruction" I extended the PdfContentStreamProcessor class about a new method ProcessTexts which returns a list of these atomic strings. My workaround is not very pretty as I had to copy paste some private fields and methods from the original source but it works for me.
class PdfContentStreamProcessorEx : PdfContentStreamProcessor
{
private IDictionary<int, CMapAwareDocumentFont> cachedFonts = new Dictionary<int, CMapAwareDocumentFont>();
private ResourceDictionary resources = new ResourceDictionary();
private CMapAwareDocumentFont font = null;
public PdfContentStreamProcessorEx(IRenderListener renderListener) : base(renderListener)
{
}
public List<string> ProcessTexts(byte[] contentBytes, PdfDictionary resources)
{
this.resources.Push(resources);
var texts = new List<string>();
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(contentBytes)));
PdfContentParser ps = new PdfContentParser(tokeniser);
List<PdfObject> operands = new List<PdfObject>();
while (ps.Parse(operands).Count > 0)
{
PdfLiteral oper = (PdfLiteral)operands[operands.Count - 1];
if ("Tj".Equals(oper.ToString()))
{
texts.Add(getText((PdfString)operands[0]));
}
else if ("TJ".Equals(oper.ToString()))
{
string text = string.Empty;
foreach (PdfObject entryObj in (PdfArray)operands[0])
{
if (entryObj is PdfString)
{
text += getText((PdfString)entryObj);
}
}
texts.Add(text);
}
else if ("Tf".Equals(oper.ToString()))
{
PdfName fontResourceName = (PdfName)operands[0];
float size = ((PdfNumber)operands[1]).FloatValue;
PdfDictionary fontsDictionary = resources.GetAsDict(PdfName.FONT);
CMapAwareDocumentFont _font;
PdfObject fontObject = fontsDictionary.Get(fontResourceName);
if (fontObject is PdfDictionary)
_font = GetFont((PdfDictionary)fontObject);
else
_font = GetFont((PRIndirectReference)fontObject);
font = _font;
}
}
this.resources.Pop();
return texts;
}
string getText(PdfString #in)
{
byte[] bytes = #in.GetBytes();
return font.Decode(bytes, 0, bytes.Length);
}
private CMapAwareDocumentFont GetFont(PRIndirectReference ind)
{
CMapAwareDocumentFont font;
cachedFonts.TryGetValue(ind.Number, out font);
if (font == null)
{
font = new CMapAwareDocumentFont(ind);
cachedFonts[ind.Number] = font;
}
return font;
}
private CMapAwareDocumentFont GetFont(PdfDictionary fontResource)
{
return new CMapAwareDocumentFont(fontResource);
}
private class ResourceDictionary : PdfDictionary
{
private IList<PdfDictionary> resourcesStack = new List<PdfDictionary>();
virtual public void Push(PdfDictionary resources)
{
resourcesStack.Add(resources);
}
virtual public void Pop()
{
resourcesStack.RemoveAt(resourcesStack.Count - 1);
}
public override PdfObject GetDirectObject(PdfName key)
{
for (int i = resourcesStack.Count - 1; i >= 0; i--)
{
PdfDictionary subResource = resourcesStack[i];
if (subResource != null)
{
PdfObject obj = subResource.GetDirectObject(key);
if (obj != null) return obj;
}
}
return base.GetDirectObject(key); // shouldn't be necessary, but just in case we've done something crazy
}
}
}

Issues with iTextsharp and pdf manipulation

I am getting a pdf-document (no password) which is generated from a third party software with javascript and a few editable fields in it. If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages if I save the stream afterwards. When I now try to open the document the Acrobat Reader shows an extended feature warning and the fields are not fillable anymore (I haven't flattened the document). Do anyone know about such a problem?
Background Info:
My job is to remove the javascript code, fill out some fields and save the document afterwards.
I am using the iTextsharp version 5.5.3.0.
Unfortunately I can't upload a sample file because there are some confidental data in it.
private byte[] GetDocumentData(string documentName)
{
var document = String.Format("{0}{1}\\{2}.pdf", _component.OutputDirectory, _component.OutputFileName.Replace(".xml", ".pdf"), documentName);
if (File.Exists(document))
{
PdfReader.unethicalreading = true;
using (var originalData = new MemoryStream(File.ReadAllBytes(document)))
{
using (var updatedData = new MemoryStream())
{
var pdfTool = new PdfInserter(originalData, updatedData) {FormFlattening = false};
pdfTool.RemoveJavascript();
pdfTool.Save();
return updatedData.ToArray();
}
}
}
return null;
}
//Old version that wasn't working
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
}
//Solution
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream, char pdfVersion = '\0', bool append = true)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
}
public void RemoveJavascript()
{
for (int i = 0; i <= _pdfReader.XrefSize; i++)
{
PdfDictionary dictionary = _pdfReader.GetPdfObject(i) as PdfDictionary;
if (dictionary != null)
{
dictionary.Remove(PdfName.AA);
dictionary.Remove(PdfName.JS);
dictionary.Remove(PdfName.JAVASCRIPT);
}
}
}
The extended feature warning is a hint that the original PDF had been signed using a usage rights signature to "Reader-enable" it, i.e. to tell the Adobe Reader to activate some additional features when opening it, and the OP's operation on it has invalidated the signature.
Indeed, he operated using
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
which creates a PdfStamper which completely re-generates the document. To not invalidate the signature, though, one has to use append mode as in the OP's fixed code (for char pdfVersion = '\0', bool append = true):
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages
Quite likely it is a PDF with a XFA form, i.e. the PDF is only a carrier of some XFA data from which Adobe Reader builds those 17 pages. The actual PDF in that case usually only contains one page saying something like "if you see this, your viewer does not support XFA."
For a final verdict, though, one has to inspect the PDF.

Unable to get page number when using RenderListener interface to find a piece of text in PDF

iText requires coordinates to create form fields and Page Number in existing PDFs at different places.
My PDF is dynamic. So I decided to creat the PDF with some identifier text. And use TextRenderInfo to find the coordinates for the text and use those coordinates to creat the textfields and other form fields.
ParsingHelloWorld.java
public void extractText(String src, String dest) throws IOException, DocumentException {
PrintWriter out = new PrintWriter(new FileOutputStream(dest));
PdfReader reader = new PdfReader(src);
PdfStamper stp = new PdfStamper(reader, new FileOutputStream(dest);
RenderListener listener = new MyTextRenderListener(out,reader,stp);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for ( int pageNum= 0; pageNum < reader.getNumberOfPages(); pageNum++ ){
PdfDictionary pageDic = reader.getPageN(pageNum);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNum), resourcesDic);
}
out.flush();
out.close();
stp.close();
}
MyTextRenderListener.java
public void renderText(TextRenderInfo renderInfo) {
if (renderInfo.getText().startsWith("Fill_in_TextField")){
// creates the text fields by getting co-ordinates form the renderinfo object.
createTextField(renderInfo);
}else if (renderInfo.getText().startsWith("Fill_in_SignatureField")){
// creates the text fields by getting co-ordinates form the renderinfo object.
createSignatureField(renderInfo);
}
}
The problem is I have a page number in extractText method in the ParsingHelloWorld class.
When the renderText method is called inside the MyTextRenderListener class internally processing the page content, I couldn't get the pageNumber to generate the fields in the PDF at the particular coordinates where the identifier text resides(ex Fill_in_TextField,Fill_in_SignatureField..etc ).
Any suggestions/ ideas to get the page number in my scenario.
Thanks in advance.
That's easy. Add a parameter to MyTextListener:
protected int page;
public void setPage(int page) {
this.page = page;
}
Now when you loop over the pages in ParsingHelloWorld, pass the page number to MyTextListener:
listener.setPage(pageNum);
Now you have access to that number in the renderText() method and you can pass it to your createTextField() method.
Note that I think your loop is wrong. Page numbers don't start at page 0, they start at page 1.