How to set define different styles for the same paragraph - apache

I'm trying to convert html text to generate a word table. It works pretty well, and the created word file is correct, except the character styles.
This is my first try with Apache POI.
So far, I was able to detect new line (<br>) tags from text paragraph (see code below). But I'd like to also check a few other tags such as <b>, <li>, <font> and set the right run values for each part.
For example :
This is my text <i> which now is in italic<b> but also in bold</b> depending on its importance</i>
I gess I should parse the text, and apply different runs for each part, but I don't know how to do.
private static XWPFParagraph getTableParagraph(XWPFTableCell cell, String text)
{
int fontsize= 11;
XWPFParagraph paragraph = cell.addParagraph();
cell.removeParagraph(0);
paragraph.setSpacingAfterLines(0);
paragraph.setSpacingAfter(0);
XWPFRun myRun1 = paragraph.createRun();
if (text==null) text="";
else
{
while (true)
{
int x = text.indexOf("<br>");
if (x <0) break;
String work = text.substring(0,x );
text= text.substring(x+4);
myRun1.setText(work);
myRun1.addBreak();
}
}
myRun1.setText(text);
myRun1.setFontSize(fontsize);
return paragraph;
}

While converting HTML text one never should go on the HTML using string methods only. XML as well as HTML are markup languages. Their content is markup and not only plain text. The markup needs to be traversed to get all the single nodes together with the meanings out of it. This traversing process never is trivial and so special libraries are there for. Deep inside those libraries also needs using string methods but those are wrapped into useful methods for traversing the markup.
For traversing HTML jsoup may be used for example. Especially NodeTraversor using NodeVisitor is useful for traversing HTML.
My example creates a ParagraphNodeVisitor which implements NodeVisitor. This interface requests method public void head(Node node, int depth) which is called every time the NodeTraversor is on head of a node and public void tail(Node node, int depth) which is called every time the NodeTraversor is on tail of a node. In those methods the process for handling the single nodes can be implemented. In our case main part of the process is whether we need a new XWPFRun and what settings this run needs.
Example:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;
import org.jsoup.select.NodeTraversor;
public class HTMLtoDOCX {
private XWPFDocument document;
public HTMLtoDOCX(String html, String docxPath) throws Exception {
this.document = new XWPFDocument();
XWPFParagraph paragraph = null;
Document htmlDocument = Jsoup.parse(html);
Elements htmlParagraphs = htmlDocument.select("p");
for(Element htmlParagraph : htmlParagraphs) {
System.out.println(htmlParagraph);
paragraph = document.createParagraph();
createParagraphFromHTML(paragraph, htmlParagraph);
}
FileOutputStream out = new FileOutputStream(docxPath);
document.write(out);
out.close();
document.close();
}
void createParagraphFromHTML(XWPFParagraph paragraph, Element htmlParagraph) {
ParagraphNodeVisitor nodeVisitor = new ParagraphNodeVisitor(paragraph);
NodeTraversor.traverse(nodeVisitor, htmlParagraph);
}
private class ParagraphNodeVisitor implements NodeVisitor {
String nodeName;
boolean needNewRun;
boolean isItalic;
boolean isBold;
boolean isUnderlined;
int fontSize;
String fontColor;
XWPFParagraph paragraph;
XWPFRun run;
ParagraphNodeVisitor(XWPFParagraph paragraph) {
this.paragraph = paragraph;
this.run = paragraph.createRun();
this.nodeName = "";
this.isItalic = false;
this.isBold = false;
this.isUnderlined = false;
this.fontSize = 11;
this.fontColor = "000000";
}
#Override
public void head(Node node, int depth) {
nodeName = node.nodeName();
System.out.println("Start "+nodeName+": " + node);
if ("#text".equals(nodeName)) {
run.setText(((TextNode)node).text​());
} else if ("i".equals(nodeName)) {
isItalic = true;
} else if ("b".equals(nodeName)) {
isBold = true;
} else if ("u".equals(nodeName)) {
isUnderlined = true;
} else if ("br".equals(nodeName)) {
run.addBreak();
} else if ("font".equals(nodeName)) {
fontColor = (!"".equals(node.attr("color")))?node.attr("color").substring(1):"000000";
fontSize = (!"".equals(node.attr("size")))?Integer.parseInt(node.attr("size")):11;
}
run.setItalic(isItalic);
run.setBold(isBold);
if (isUnderlined) run.setUnderline(UnderlinePatterns.SINGLE); else run.setUnderline(UnderlinePatterns.NONE);
run.setColor(fontColor); run.setFontSize(fontSize);
}
#Override
public void tail(Node node, int depth) {
nodeName = node.nodeName();
System.out.println("End "+nodeName);
if ("#text".equals(nodeName)) {
run = paragraph.createRun(); //after setting the text in the run a new run is needed
} else if ("i".equals(nodeName)) {
isItalic = false;
} else if ("b".equals(nodeName)) {
isBold = false;
} else if ("u".equals(nodeName)) {
isUnderlined = false;
} else if ("br".equals(nodeName)) {
run = paragraph.createRun(); //after setting a break a new run is needed
} else if ("font".equals(nodeName)) {
fontColor = "000000";
fontSize = 11;
}
run.setItalic(isItalic);
run.setBold(isBold);
if (isUnderlined) run.setUnderline(UnderlinePatterns.SINGLE); else run.setUnderline(UnderlinePatterns.NONE);
run.setColor(fontColor); run.setFontSize(fontSize);
}
}
public static void main(String[] args) throws Exception {
String html =
"<p>Text without tags. <b> Then bold <br/> having break.</b> Then without tags again.</p>"
+"<p><font size='32' color='#0000FF'><b>First paragraph.</font></b><br/>Just like a heading</p>"
+"<p>This is my text <i>which now is in italic <b>but also in bold</b> depending on its <u>importance</u></i>.<br/>Now a <b><i><u>new</u></i></b> line starts <i>within <b>the same</b> paragraph</i>.</p>"
+"<p><b>Last <u>paragraph <i>comes</u> here</b> finally</i>.</p>"
+"<p>But yet <u><i><b>another</i></u></b> paragraph having <i><font size='22' color='#FF0000'>special <u>font</u> settings</font></i>. Now default font again.</p>"
;
HTMLtoDOCX htmlToDOCX = new HTMLtoDOCX(html, "./CreateWordParagraphFromHTML.docx");
}
}
Result:
Disclaimer: This is a working draft showing the principle. Neither it is fully ready nor it is code ready for use in productive environments.

Related

How to get similar behaviour as the Java browsing perspective with two commonNavigator views?

I'd like to reproduce the behavior of the Java browsing perspective as so : the user select a file (selectedFile) from a Custom Common Navigator View (VariabilityNavigator) and it displays only the related files to that selection in another Custom Common Navigator View (ConfigFileNavigator). It refreshes automatically as the selection goes.
I use Common navigator because I want to display the package of the related files and not the other ones or the empty ones. That I coded fine.
Could you please walk me thru the steps to do that ? Here is how far I got..
I have successfully coded the first view VariabilityNavigator.
I thought I would use a filter ConfigurationFilesFilter to get the selectedFile and use it to filter the related files.
I can get the selection from the VariabilityNavigator within the select(...) of ConfigurationFilesFilter code below
I then use it to filter the files I want to display using isRelatedToSelection. I haven't write that part yet, but it's not a problem for testing.
This method don't work for several reasons : it doesn't refresh and I have to click on the ConfigFileNavigator to get the result. What it should do : when I click on the first view and it should automatically display the result in the second view. Thanks for your help.
public class ConfigurationFilesFilter extends ViewerFilter {
String selectedFile;
public ConfigurationFilesFilter() {
}
#Override
public boolean select(Viewer viewer, Object parent, Object element) {
selectedFile = getVariabilityNavigatorSelection();
if (isChildElement(element)) {
return handleChild(element);
} else {
// manipulate tree viewer based on select decision for the sub elements
StructuredViewer sviewer = (StructuredViewer) viewer;
ITreeContentProvider provider = (ITreeContentProvider) sviewer.getContentProvider();
for (Object child : provider.getChildren(element)) {
if (select(viewer, element, child)) {
return true;
}
}
return false;
}
}
/**
*
* #param element
* #return true to be displayed
*/
private boolean handleChild(Object element) {
return true;
}
/**
* Check if an element is a java file and if it is related to the selection
*
* #param element
* #return boolean
*/
private boolean isChildElement(Object element) {
if (element instanceof IFile) {
if (((IFile) element).getFileExtension().equals("java")) {
return (isRelatedToSelection((IFile) element));
}
return false;
}
return false;
}
public String getVariabilityNavigatorSelection() {
IWorkbenchPage page = PlatformUI.getWorkbench().getActiveWorkbenchWindow().getActivePage();
IViewPart viewPart = page.findView(VariabilityNavigator.ID);
String VariabilityNavigatorSelection = "" ;
if (viewPart == null) {
VariabilityNavigatorSelection = "open Variability view. \r\nTo do so, open the menu as followed:\r\nWindow > Show view > Other... > SPL > Variability Navigator";
} else {
ISelectionProvider selProvider = viewPart.getSite().getSelectionProvider();
IStructuredSelection sel = (IStructuredSelection) selProvider.getSelection();
if (sel.isEmpty()) {
VariabilityNavigatorSelection = "selection vide";
}
Iterator<?> i = sel.iterator();
while (i.hasNext()) {
final Object o = i.next();
if (o instanceof IFile) {
VariabilityNavigatorSelection = ((IFile) o).getName();
} else
VariabilityNavigatorSelection = "Please select a file";
}
}
if (!VariabilityNavigatorSelection.isEmpty())
printTrace("selectPrint = " + VariabilityNavigatorSelection);
return VariabilityNavigatorSelection;
}

How to replace / remove text from a PDF file

How would I go about replacing / removing text from a PDF file?
I have a PDF file that I obtained somewhere, and I want to be able to replace some text within it.
Or, I have a PDF file that I want to obscure (redact) some of the text within it so that it's no longer visible [and so that it looks cool, like the CIA files].
Or, I have a PDF that contains global Javascript that I want to stop from interrupting my use of the PDF.
This is possible in a limited fashion with the use of iText / iTextSharp.
It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes).
You need to override the default PdfContentStreamProcessor to act on the page content streams, as presented by Mkl here Removing Watermark from PDF iTextSharp. Inherit from this class, and in your new class look for the Tj/TJ opcodes, the operand(s) will generally be the text element(s) (for a TJ this may not be straightforward text, and may require further parsing of all the operands).
A pretty basic example of some of the flexibility around iTextSharp is available from this github repository https://github.com/bevanweiss/PdfEditor (code excerpts below also)
NOTE: This utilises the AGPL version of iTextSharp (and is hence also AGPL), so if you will be distributing executables derived from this code or allowing others to interact with those executables in any way then you must also provide your modified source code. There is also no warranty, implied or expressed, related to this code. Use at your own peril.
PdfContentStreamEditor
using System.Collections.Generic;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class PdfContentStreamEditor : PdfContentStreamProcessor
{
/**
* This method edits the immediate contents of a page, i.e. its content stream.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void EditPage(PdfStamper pdfStamper, int pageNum)
{
var pdfReader = pdfStamper.Reader;
var page = pdfReader.GetPageN(pageNum);
var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
page.Remove(PdfName.CONTENTS);
EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
}
/**
* This method processes the content bytes and outputs to the given canvas.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
this.Canvas = canvas;
ProcessContent(contentBytes, resources);
this.Canvas = null;
}
/**
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions the {#link ContentOperatorWrapper} instances
* forward to it.
*
* Override this method to achieve some fancy editing effect.
*/
protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
{
var index = 0;
foreach (var pdfObject in operands)
{
pdfObject.ToPdf(null, Canvas.InternalBuffer);
Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor() : base(new DummyRenderListener())
{
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
{
}
//
// Overrides of PdfContentStreamProcessor methods
//
public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
{
var wrapper = new ContentOperatorWrapper();
wrapper.SetOriginalOperator(newOperator);
var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
}
public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
{
this.Resources = resources;
base.ProcessContent(contentBytes, resources);
this.Resources = null;
}
//
// members holding the output canvas and the resources
//
protected PdfContentByte Canvas = null;
protected PdfDictionary Resources = null;
//
// A content operator class to wrap all content operators to forward the invocation to the editor
//
class ContentOperatorWrapper : IContentOperator
{
public IContentOperator GetOriginalOperator()
{
return _originalOperator;
}
public void SetOriginalOperator(IContentOperator op)
{
this._originalOperator = op;
}
public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
if (_originalOperator != null && !"Do".Equals(oper.ToString()))
{
_originalOperator.Invoke(processor, oper, operands);
}
((PdfContentStreamEditor)processor).Write(processor, oper, operands);
}
private IContentOperator _originalOperator = null;
}
//
// A dummy render listener to give to the underlying content stream processor to feed events to
//
class DummyRenderListener : IRenderListener
{
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}
TextReplaceStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextReplaceStreamEditor : PdfContentStreamEditor
{
public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
{
_matchPattern = MatchPattern;
_replacePattern = ReplacePattern;
}
private string _matchPattern;
private string _replacePattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
var operatorString = oper.ToString();
if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
{
for(var i = 0; i < operands.Count; i++)
{
if(!operands[i].IsString())
continue;
var text = operands[i].ToString();
if(Regex.IsMatch(text, _matchPattern))
{
operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
}
}
}
base.Write(processor, oper, operands);
}
}
}
TextRedactStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextRedactStreamEditor : PdfContentStreamEditor
{
public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
{
_matchPattern = MatchPattern;
}
private string _matchPattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
base.Write(processor, oper, operands);
}
public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
base.EditContent(contentBytes, resources, canvas);
}
}
//
// A pretty simple render listener, all we care about it text stuff.
// We listen out for text blocks, look for our text, and then put a
// black box over it.. text 'redacted'
//
class RedactRenderListener : IRenderListener
{
private PdfContentByte _canvas;
private string _matchPattern;
public RedactRenderListener(string MatchPattern)
{
_matchPattern = MatchPattern;
}
public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
{
_canvas = Canvas;
_matchPattern = MatchPattern;
}
public void SetCanvas(PdfContentByte Canvas)
{
_canvas = Canvas;
}
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo)
{
var text = renderInfo.GetText();
var match = Regex.Match(text, _matchPattern);
if(match.Success)
{
var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();
_canvas.SaveState();
_canvas.SetColorStroke(BaseColor.BLACK);
_canvas.SetColorFill(BaseColor.BLACK);
_canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
_canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
_canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
_canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
_canvas.ClosePathFillStroke();
_canvas.RestoreState();
}
}
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Using them with iTextSharp
var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);
pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);
// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;
// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
"TEXT TO REPLACE HERE",
"TEXT TO SUBSTITUTE IN HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
replaceTextProcessor.EditPage(pdfStamper, i);
// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
"TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null,
Encoding.UTF8.GetBytes("ownerPassword"),
PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
PdfWriter.ENCRYPTION_AES_256);
// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";
// and then finally we close our files (saving it in the process)
pdfStamper.Close();
reader.Close();
You can use GroupDocs.Redaction (available for .NET) for replacing or removing the text from PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction (removal) of the text. The following code snippet replaces the word "candy" with "[redacted]" in the loaded PDF document.
C#:
using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
// Save the document to "*_Redacted.*" file.
doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false });
}
Disclosure: I work as Developer Evangelist at GroupDocs.

How to remove a line with a particular content in docx4j

I want to delete a particular line in the docx if it has a particular word, say "killer".
How i can write a program using docx4j?
If i replace it with empty data, the line will be still there. I want to remove the whole line.
I tried something like this,
private void replacePlaceholders(WordprocessingMLPackage targetDocument,
String nameOfTheInvitedGuest) throws JAXBException {
List<Object> texts = targetDocument.getMainDocumentPart()
.getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
System.out.println(texts.size());
Iterator<Object> itr = texts.iterator();
while (itr.hasNext()) {
Object obj = itr.next();
Text text = (Text) ((JAXBElement) obj).getValue();
// System.out.println(text.getValue());
if (text.getValue().contains("Hulk Hogan")) {
itr.remove();
}
else {
String textValue = replacePlaceholderOfInvitedGuestWithGivenName(
nameOfTheInvitedGuest, text.getValue());
for (Object key : templateProperties.keySet()) {
textValue = textValue.replaceAll("\\$\\{" + key + "\\}",
(String) templateProperties.get(key));
}
text.setValue(textValue);
}
}
System.out.println(texts.size());
}
But its still showing in the docx file.
A Text element in a docx file has parent elements. The text will reside within a Run which in turn will sit within a block element like a paragraph (P node) or a table cell. If you're looking to remove a particular block element based on its textual content, once you've located the relevant text elements, you need to move up the parent elements, and remove them too -- for example, if the ultimate parent is a paragraph node, remove that.
If say, a paragraph displays as 3 lines in Word and you are trying to remove the 2nd line in that paragraph, then you have a different and more challenging problem.
maybe this will help futur people :
if(((org.docx4j.wml.Text) o2).getValue().contains("WhatYouWant")) {
// if your text contains "WhatYouWant" then...
Object o4 =((org.docx4j.wml.Text)o2).getParent();
//gets R
Object o5 = ((org.docx4j.wml.R) o4).getParent();
// gets P
Object o6 = ((org.docx4j.wml.P) o5).getParent();
// gets SdtElement
((List<List<Object>>) o6).remove(o5);
// now you remove your P (paragraph)
}
I had a content control (SdtElement) but I needed to put it in List < List < Object > > don't really know why but.... You might have something else so check in your document.xml before copy/pasting this.
This is for others who have a hard time, like I did to understand docx4j
You could use Apache POI to remove Text from docx file as shown below.
public static void removeTextFromDocx(FileInputStream inpudocxfile, String stringToBeReplaced,
String stringToBeReplacedWith, FileOutputStream outputdocxfile) {
XWPFDocument document = null;
try {
//loading docx file
document = new XWPFDocument(inpudocxfile);
for (XWPFParagraph paragraph : document.getParagraphs()) {
List<XWPFRun> runs = paragraph.getRuns();
for (XWPFRun run : runs) {
//reading an entire paragraph. So size of list is 1 and index of first element is 0
String text = run.getText(0);
if (text != null) {
if (text.contains(stringToBeReplaced)) {
text = text.replace(stringToBeReplaced, stringToBeReplacedWith);
text = text.trim();
run.setText(text, 0);
}
}
}
}
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
String text = run.getText(0);
if (text != null) {
if (text.contains(stringToBeReplaced)) {
text = text.replace(stringToBeReplaced, stringToBeReplacedWith);
text = text.trim();
run.setText(text, 0);
}
}
}
}
}
}
}
document.write(outputdocxfile);
} catch (IOException e) {
LOGGER.error("Could not create outputdocxFile --> IOEXception" + e);
}
}

Text extraction from table cells

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).
This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)
Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}

Telerik Mail Merge - Friendly Names (using RadRichTextBox)

I think i'm missing something obvious...
I'm using the Telerik Rad controls for WPF but i assume that the Rich text box uses some similar implementation for the mail merge functionality.
I want to have some friendly names on my mail merge fields. (namely spaces in the field names)
So i have a class for instance
Public Class someclass
{
<DisplayName("This is the complex description of the field")>
Public property thisfieldnamehasacomplexdescription as string
Public property anothercomplexfield as string
}
This is the only way i know to get "Friendly" names in the dropdown that is the mail merge.
So the two fields turn up okay as :
"This is the complex description of the field"
"anothercomplexfield"
but only anothercomplexfield actually populates with data when you do the merge.
Am i going to have to template the raddropdownbutton that holds the mail merge fields?
Is there an example of this somewhere?
Also a sub question. How do i add a scroll bar on these things?
(also i know this board is not a TELERIK specific board (duh!) but this might be useful to someone in the future. So i'll copy the answer i get from Telerik into here!
http://www.telerik.com/community/forums/wpf/richtextbox/558428-radrichtextbox-mailmerge---using-displayname-to-create-a-friendly-name-with-spaces.aspx )
This is what telerik gave me:
With the default MergeFields, it is not possible to change the display name fragment of the field in order to achieve a more friendly look. This should be possible if you implement a custom MergeField by deriving from the MergeField class. Here is a sample implementation that shows how this can be done:
public class CustomMergeField : MergeField
{
private const string CustomFieldName = "CustomField";
static CustomMergeField()
{
CodeBasedFieldFactory.RegisterFieldType(CustomMergeField.CustomFieldName, () => { return new CustomMergeField(); });
}
public override string FieldTypeName
{
get
{
return CustomMergeField.CustomFieldName;
}
}
public override Field CreateInstance()
{
return new CustomMergeField();
}
protected override DocumentFragment GetDisplayNameFragment()
{
return base.CreateFragmentFromText(string.Format(Field.DisplayNameFragmentFormat, this.GetFriendlyFieldName(this.PropertyPath)));
}
private string GetFriendlyFieldName(string fieldName)
{
int lettersInEnglishAlphabet = 26;
List<char> separators = new List<char>(lettersInEnglishAlphabet);
for (int i = 0; i < lettersInEnglishAlphabet; i++)
{
separators.Add((char)('A' + i));
}
StringBuilder newFieldName = new StringBuilder();
int previousIndex = 0;
for (int i = 1; i < fieldName.Length; i++)
{
if (separators.Contains(fieldName[i]))
{
if (previousIndex > 0)
{
newFieldName.Append(" ");
}
newFieldName.Append(fieldName.Substring(previousIndex, i - previousIndex));
previousIndex = i;
}
}
newFieldName.Append(" " + fieldName.Substring(previousIndex));
return newFieldName.ToString();
}
}
Note that the fragment that is shown when the DisplayMode is Code cannot be changed.
As for your other question, you can change the content of the dropdown button to show the friendly name of the fields and to include a scrollbar in the following way:
1. First, remove the binding of the button to the InsertMergeFieldEmptyCommand from XAML and give it a name (e.g. insertMergeField).
2. Next, add the following code in code-behind:
AddMergeFieldsInDropDownContent(this.insertMergeFieldButton);
private void AddMergeFieldsInDropDownContent(RadRibbonDropDownButton radRibbonDropDownButton)
{
Grid grid = new Grid();
grid.RowDefinitions.Add(new RowDefinition() { Height = new GridLength(100, GridUnitType.Pixel) });
ScrollViewer scrollViewer = new ScrollViewer();
scrollViewer.VerticalScrollBarVisibility = ScrollBarVisibility.Auto;
StackPanel stackPanel = new StackPanel();
foreach (string fieldName in this.editor.Document.MailMergeDataSource.GetColumnNames())
{
RadRibbonButton fieldButton = new RadRibbonButton()
{
Text = this.GetFriendlyFieldName(fieldName),
Size = ButtonSize.Medium,
HorizontalAlignment = HorizontalAlignment.Stretch,
HorizontalContentAlignment = HorizontalAlignment.Left
};
fieldButton.Command = this.editor.Commands.InsertFieldCommand;
fieldButton.CommandParameter = new MergeField() { PropertyPath = fieldName };
//or
//fieldButton.CommandParameter = new CustomMergeField() { PropertyPath = fieldName };
stackPanel.Children.Add(fieldButton);
}
stackPanel.HorizontalAlignment = System.Windows.HorizontalAlignment.Stretch;
scrollViewer.Content = stackPanel;
grid.Children.Add(scrollViewer);
radRibbonDropDownButton.DropDownContent = grid;
}
You can, of course optimize the code of the GetFriendlyName method and add it in a way that will be available by both classes.