iText LocationTextExtractionStrategy/HorizontalTextExtractionStrategy splits text into single characters - pdf

I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
For example "Detail" is not any more one object within the locationalResult list but splitted into six items (D, e, t, a, i, l)
I tried using the HorizontalTextExtractionStrategy by making the getLocationalResult method public:
public List<TextChunk> GetLocationalResult()
{
return (List<TextChunk>)locationalResultField.GetValue(this);
}
and using the PdfReaderContentParser to extract the texts:
reader = new PdfReader("some_pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(i, HorizontalTextExtractionStrategy());
foreach (HorizontalTextExtractionStrategy.HorizontalTextChunk chunk in strategy.GetLocationalResult())
{
// Do something with the chunk
}
but this also returns the same result.
Is there any other way to extract connected texts from a pdf?

I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
This problem is due to wrong expectations concerning the contents of the LocationTextExtractionStrategy.locationalResult private list member variable.
This list of TextChunk instances contains the pieces of text as they were forwarded to the strategy from the parsing framework (or probably as they were preprocessed by some filter class), and the framework forwards each single string it encounters in a content stream separately.
Thus, if a seemingly connected word in the content stream actually is drawn using multiple strings, you get multiple TextChunk instances for it.
There actually is some "intelligence" in the method getResultantText joining these chunks properly, adding a space where necessary and so on.
In case of your document, "DETAIL " usually is drawn like this:
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
As you see there are slight text insertion point moves between 'D' and 'E', 'T' and 'A', 'I' and 'L', and 'L' and ' '. (Such mini moves usually represent kerning.) Thus, you'll get individual TextChunk instances for 'D', 'ET', 'AI', and 'L '.
Admittedly, the LocationTextExtractionStrategy.locationalResult member is not very well documented; but as it is a private member, this IMHO is forgivable.
That this worked well for many documents is due to many PDF creators not applying kerning and simply drawing connected text using single string objects.
The HorizontalTextExtractionStrategy is derived from the LocationTextExtractionStrategy and mainly differs from it in the way it arranges the TextChunk instances to a single string. Thus, you'll see the same fragmentation here.
Is there any other way to extract connected texts from a pdf?
If you want "connected texts" as in "atomic string objects in the content stream", you already have them.
If you want "connected texts" as in "visually connected texts, no matter where the constituent letters are drawn in the content stream", you have to glue those TextChunk instances together like the LocationTextExtractionStrategy and HorizontalTextExtractionStrategy do in getResultantText in combination with the comparison methods in their respective TextChunkLocationDefaultImp and HorizontalTextChunkLocation implementations.

After debugging deep into the iTextSharp library I figured out that my texts are drawn with the TJ operator as mkl also mentioned.
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
iText processes these texts not as a single PdfString but as an array of PdfObjects which ends up in calling renderListener.RenderText(renderInfo) for each PdfString item in it (see ShowTextArray class and DisplayPdfString method). In the RenderText method however the information about the relation of the pdf strings within the array got lost and every item is added to locationalResult as an independent object.
As my goal is to extract the "argument of a single text drawing instruction" I extended the PdfContentStreamProcessor class about a new method ProcessTexts which returns a list of these atomic strings. My workaround is not very pretty as I had to copy paste some private fields and methods from the original source but it works for me.
class PdfContentStreamProcessorEx : PdfContentStreamProcessor
{
private IDictionary<int, CMapAwareDocumentFont> cachedFonts = new Dictionary<int, CMapAwareDocumentFont>();
private ResourceDictionary resources = new ResourceDictionary();
private CMapAwareDocumentFont font = null;
public PdfContentStreamProcessorEx(IRenderListener renderListener) : base(renderListener)
{
}
public List<string> ProcessTexts(byte[] contentBytes, PdfDictionary resources)
{
this.resources.Push(resources);
var texts = new List<string>();
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(contentBytes)));
PdfContentParser ps = new PdfContentParser(tokeniser);
List<PdfObject> operands = new List<PdfObject>();
while (ps.Parse(operands).Count > 0)
{
PdfLiteral oper = (PdfLiteral)operands[operands.Count - 1];
if ("Tj".Equals(oper.ToString()))
{
texts.Add(getText((PdfString)operands[0]));
}
else if ("TJ".Equals(oper.ToString()))
{
string text = string.Empty;
foreach (PdfObject entryObj in (PdfArray)operands[0])
{
if (entryObj is PdfString)
{
text += getText((PdfString)entryObj);
}
}
texts.Add(text);
}
else if ("Tf".Equals(oper.ToString()))
{
PdfName fontResourceName = (PdfName)operands[0];
float size = ((PdfNumber)operands[1]).FloatValue;
PdfDictionary fontsDictionary = resources.GetAsDict(PdfName.FONT);
CMapAwareDocumentFont _font;
PdfObject fontObject = fontsDictionary.Get(fontResourceName);
if (fontObject is PdfDictionary)
_font = GetFont((PdfDictionary)fontObject);
else
_font = GetFont((PRIndirectReference)fontObject);
font = _font;
}
}
this.resources.Pop();
return texts;
}
string getText(PdfString #in)
{
byte[] bytes = #in.GetBytes();
return font.Decode(bytes, 0, bytes.Length);
}
private CMapAwareDocumentFont GetFont(PRIndirectReference ind)
{
CMapAwareDocumentFont font;
cachedFonts.TryGetValue(ind.Number, out font);
if (font == null)
{
font = new CMapAwareDocumentFont(ind);
cachedFonts[ind.Number] = font;
}
return font;
}
private CMapAwareDocumentFont GetFont(PdfDictionary fontResource)
{
return new CMapAwareDocumentFont(fontResource);
}
private class ResourceDictionary : PdfDictionary
{
private IList<PdfDictionary> resourcesStack = new List<PdfDictionary>();
virtual public void Push(PdfDictionary resources)
{
resourcesStack.Add(resources);
}
virtual public void Pop()
{
resourcesStack.RemoveAt(resourcesStack.Count - 1);
}
public override PdfObject GetDirectObject(PdfName key)
{
for (int i = resourcesStack.Count - 1; i >= 0; i--)
{
PdfDictionary subResource = resourcesStack[i];
if (subResource != null)
{
PdfObject obj = subResource.GetDirectObject(key);
if (obj != null) return obj;
}
}
return base.GetDirectObject(key); // shouldn't be necessary, but just in case we've done something crazy
}
}
}

Related

iTextSharp Read Text From Single Layer of PDF

Currently I am using a custom LocationTextExtractionStrategy to extract text from a PDF that returns a TextRenderInfo[]. I would like to be able to determine if a TextRenderInfo object (or PDFString, child of TextRenderInfo) appears in a specific layer. I am not sure if this is possible. To get the layers in a PDF, I am using:
Dictionary<string,PdfLayer> layers;
using (var pdfReader = new PdfReader(src))
{
var newSrc = Path.Combine(["new file location"]);
using (var stream = new FileStream(newSrc, FileMode.Create))
{
PdfStamper stamper = new PdfStamper(pdfReader, stream);
layers = stamper.GetPdfLayers();
stamper.Close();
}
pdfReader.Close();
src = newSrc;
}
To extract the text, I am using:
var textExtractor = new TextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, pdfPageNum,textExtractor);
List<TextRenderInfo> results = textExtractor.Results;
Is there any way that I can check if the individual TextRenderInfo results exist within the layers obtained in the first code snippet. Any help would be much appreciated.
It is possible to get the contents from a single layer, but you'll have to jump through a few hoops to work it out. Specifically, you will have to recreate some of the logic that is provided by the PdfTextExtractor and PdfReaderContentParser.
public static String GetText(PdfReader reader, int pageNumber, int streamNumber) {
var strategy = new LocationTextExtractionStrategy();
var processor = new PdfContentStreamProcessor(strategy);
var resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
// assuming you still need to extract the page bytes
byte[] contents = GetContentBytesForPageStream(reader, pageNumber, streamNumber);
processor.ProcessContent(contents, resourcesDic);
return strategy.GetResultantText();
}
public static byte[] GetContentBytesForPageStream(PdfReader reader, int pageNumber, int streamNumber) {
PdfDictionary pageDictionary = reader.GetPageN(pageNum);
PdfObject contentObject = pageDictionary.Get(PdfName.CONTENTS);
if (contentObject == null)
return new byte[0];
byte[] contentBytes = GetContentBytesFromContentObject(contentObject, streamNumber);
return contentBytes;
}
public static byte[] GetContentBytesFromContentObject(PdfObject contentObject, int streamNumber) {
// copy-paste logic from
// ContentByteUtils.GetContentBytesFromContentObject(contentObject);
// but in case PdfObject.ARRAY: only select the streamNumber you require
}
If you're specifically looking to just use PdfTextExtractor or PdfReaderContentParser, and ask the returned TextRenderInfo for the layer it's on, then I'm not sure it will be easily possible. There are a number of problems with that:
TextRenderInfo doesn't store that information, so you'd have to subclass it (which is possible)
you'd have to rewrite the logic that creates the TextRenderInfo objects. This is possible by registering custom IContentOperator objects for all text operators (Tj, TJ, ' and ") with the PdfTextExtractor or PdfReaderContentParser
the hardest part is that you have already lost layer information in ContentByteUtils.GetContentBytesFromContentObject - so you'd need to retain that somehow, which creates its own set of problems.

Trying to replace graphics resources in a PDF - PDFBox 2.0.8

I'm trying to manipulate image resources in some PDF files; the workflow is: extract image resources -> process each -> replace old ones with the new.
Simple task really, I have working code for extracting and replacing, but when I replace, the new file size is nearly twice the original.
To replace the images, I use PDResources.put(COSName, PDXObject). Any ideas what would cause the size increase in the resulting document? It happens even if I completely omit the middle step in the workflow to process each image resource.
public static void PDFBoxReplaceImages() throws Exception {
PDDocument document = PDDocument.load(new File("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof PDImageXObject) {
counter++;
String path = "C:\\Users\\Markus\\workspace\\pdf-test\\images\\"+counter+".png";
PDImageXObject newImg =
PDImageXObject.createFromFile(path, document);
pdResources.put(c, newImg);
}
}
}
document.save("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf");
}

ITextSharp crop PDF to remove white margins [duplicate]

I have a pdf which comprises of some data, followed by some whitespace. I don't know how large the data is, but I'd like to trim off the whitespace following the data
PdfReader reader = new PdfReader(PDFLOCATION);
Rectangle rect = new Rectangle(700, 2000);
Document document = new Document(rect);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(SAVELCATION));
document.open();
int n = reader.getNumberOfPages();
PdfImportedPage page;
for (int i = 1; i <= n; i++) {
document.newPage();
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
Is there a way to clip/trim the whitespace for each page in the new document?
This PDF contains vector graphics.
I'm usung iTextPDF, but can switch to any Java library (mavenized, Apache license preferred)
As no actual solution has been posted, here some pointers from the accompanying itext-questions mailing list thread:
As you want to merely trim pages, this is not a case of PdfWriter + getImportedPage usage but instead of PdfStamper usage. Your main code using a PdfStamper might look like this:
PdfReader reader = new PdfReader(resourceStream);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf"));
// Go through all pages
int n = reader.getNumberOfPages();
for (int i = 1; i <= n; i++)
{
Rectangle pageSize = reader.getPageSize(i);
Rectangle rect = getOutputPageSize(pageSize, reader, i);
PdfDictionary page = reader.getPageN(i);
page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
stamper.markUsed(page);
}
stamper.close();
As you see I also added another argument to your getOutputPageSize method to-be. It is the page number. The amount of white space to trim might differ on different pages after all.
If the source document did not contain vector graphics, you could simply use the iText parser package classes. There even already is a TextMarginFinder based on them. In this case the getOutputPageSize method (with the additional page parameter) could look like this:
private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader reader, int page) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
System.out.printf("Text/bitmap boundary: %f,%f to %f, %f\n", finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
return result;
}
Using this method with your file test.pdf results in:
As you see the code trims according to text (and bitmap image) content on the page.
To find the bounding box respecting vector graphics, too, you essentially have to do the same but you have to extend the parser framework used here to inform its listeners (the TextMarginFinder essentially is a listener to drawing events sent from the parser framework) about vector graphics operations, too. This is non-trivial, especially if you don't know PDF syntax by heart yet.
If your PDFs to trim are not too generic but can be forced to include some text or bitmap graphics in relevant positions, though, you could use the sample code above (probably with minor changes) anyways.
E.g. if your PDFs always start with text on top and end with text at the bottom, you could change getOutputPageSize to create the result rectangle like this:
Rectangle result = new Rectangle(pageSize.getLeft(), finder.getLly(), pageSize.getRight(), finder.getUry());
This only trims top and bottom empty space:
Depending on your input data pool and requirements this might suffice.
Or you can use some other heuristics depending on your knowledge on the input data. If you know something about the positioning of text (e.g. the heading to always be centered and some other text to always start at the left), you can easily extend the TextMarginFinder to take advantage of this knowledge.
Recent (April 2015, iText 5.5.6-SNAPSHOT) improvements
The current development version, 5.5.6-SNAPSHOT, extends the parser package to also include vector graphics parsing. This allows for an extension of iText's original TextMarginFinder class implementing the new ExtRenderListener methods like this:
#Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
List<Vector> points = new ArrayList<Vector>();
if (renderInfo.getOperation() == PathConstructionRenderInfo.RECT)
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
float w = renderInfo.getSegmentData().get(2);
float h = renderInfo.getSegmentData().get(3);
points.add(new Vector(x, y, 1));
points.add(new Vector(x+w, y, 1));
points.add(new Vector(x, y+h, 1));
points.add(new Vector(x+w, y+h, 1));
}
else if (renderInfo.getSegmentData() != null)
{
for (int i = 0; i < renderInfo.getSegmentData().size()-1; i+=2)
{
points.add(new Vector(renderInfo.getSegmentData().get(i), renderInfo.getSegmentData().get(i+1), 1));
}
}
for (Vector point: points)
{
point = point.cross(renderInfo.getCtm());
Rectangle2D.Float pointRectangle = new Rectangle2D.Float(point.get(Vector.I1), point.get(Vector.I2), 0, 0);
if (currentPathRectangle == null)
currentPathRectangle = pointRectangle;
else
currentPathRectangle.add(pointRectangle);
}
}
#Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
if (textRectangle == null)
textRectangle = currentPathRectangle;
else
textRectangle.add(currentPathRectangle);
}
currentPathRectangle = null;
return null;
}
#Override
public void clipPath(int rule)
{
}
(Full source: MarginFinder.java)
Using this class to trim the white space results in
which is pretty much what one would hope for.
Beware: The implementation above is far from optimal. It is not even correct as it includes all curve control points which is too much. Furthermore it ignores stuff like line width or wedge types. It actually merely is a proof-of-concept.
All test code is in TestTrimPdfPage.java.

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5