How to delete a paragraph using XWPF - Apache POI

How to delete a paragraph using XWPF - Apache POI - apache

I am trying to delete a paragraph from the .docx document i have generated using the Apache poi XWPF. I can do it easily with the .doc word document using HWPF as below :
for (String paraCount : plcHoldrPargrafDletdLst) {
Paragraph ph = doc.getRange().getParagraph(Integer.parseInt(paraCount));
System.out.println("Deleted Paragraph Start & End: " + ph.getStartOffset() +" & " + ph.getEndOffset());
System.out.println("Deleted Paragraph Test: " + ph.text());
ph.delete();
}
I tried to do the same with
doc.removeBodyElement(Integer.parseInt(paraCount));
But unfortunatley not successful enough to get the result as i want. The result document, i cannot see the paragraph deleted.
Any suggestions on how to accompolish the similar functionality in XWPF.

Ok, this question is a bit old and might not be required anymore, but I just found a different solution than the suggested one.
Hope the following code will help somebody with the same issue
...
FileInputStream fis = new FileInputStream(fileName);
XWPFDocument doc = new XWPFDocument(fis);
fis.close();
// Find a paragraph with todelete text inside
XWPFParagraph toDelete = doc.getParagraphs().stream()
.filter(p -> StringUtils.equalsIgnoreCase("todelete", p.getParagraphText()))
.findFirst().orElse(null);
if (toDelete != null) {
doc.removeBodyElement(doc.getPosOfParagraph(toDelete));
OutputStream fos = new FileOutputStream(fileName);
doc.write(fos);
fos.close();
}

Seems like you're really unable to remove paragraphs from a .docx file.
What you should be able to do is removing the content of paragraphs... So called Runs.You could try with this one:
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (XWPFParagraph paragraph : paragraphs)
{
for (int i = 0; i < paragraph.getRuns().size(); i++)
{
paragraph.removeRun(i);
}
}
You can also specify which Run of which Paragraph should be removed e.g.
paragraphs.get(23).getRuns().remove(17);

all rights reserved
// Remove all existing runs
removeRun(para, 0);
public static void removeRun(XWPFParagraph para, int depth)
{
if(depth > 10)
{
return;
}
int numberOfRuns = para.getRuns().size();
// Remove all existing runs
for(int i = 0; i < numberOfRuns; i++)
{
try
{
para.removeRun(numberOfRuns - i - 1);
}
catch(Exception e)
{
//e.printStackTrace();
}
}
if(para.getRuns().size() > 0)
{
removeRun(para, ++depth);
}
}

I like Apache POI, and for the most part its great, however I have found the documentation a little scatty to say the least.
The elusive way of deleting a paragraph I found to be quite a nightmare, giving me the following exception error when try to remove a paragraph:
java.util.ConcurrentModificationException
As mention in Ugo Delle Donne example, I solved this by first recording the paragraph that I wanted to delete, and then using the removeBodyElement method the document.
e.g.
List<XWPFParagraph> record = new ArrayList<XWPFParagraph>();
String text = "";
for (XWPFParagraph p : doc.getParagraphs()){
for (XWPFRun r : p.getRuns()){
text += r.text();
// I saw so many examples as r.getText(pos), don't use that
// Find some unique text in the paragraph
//
if (!(text==null) && (text.contains("SOME-UNIQUE-TEXT")) {
// Save the Paragraph to delete for later
record.add( p );
}
}
}
// Now delete the paragraph and anything within it.
for(int i=0; i< record.size(); i++)
{
// Remove the Paragraph and everything within it
doc.removeBodyElement(doc.getPosOfParagraph( record.get(i) ));
}
// Shaaazam, I hope this helps !

I believe your question was answered in this question.
When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));

Related

How to avoid splitting of word with case

Below is the text that I want to split:
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse'
(TAKE SOT)"
string[] contentList = content.split(' ');
I do not want to split the word i.e (TAKE SOT), if there is a text within parentheses and in an upper case then how to avoid splitting of the part.
Thanks

The split method can have two parameters.The first one delimits the substrings in this instance.The second one is the maximum number of elements expected in the array.
The following code snippet work for me, you can refer to it.
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT)";
string[] contents = content.Split(" ", content.Substring(0, content.IndexOf("(")).Split(" ").Length);
And this is the result of the code:
I use the method Split(String, Int32, StringSplitOptions)

Later after doing some research I found a solution for the problem:
public class Program
{
public static void Main()
{
string BEFORE_AND_AFTER_SOUND_EFFECT = #"\(([A-Z\s]*)\)";
List<string> wordList = new List<string>();
string content = "Tonight the warning from the Police commissioner as they (VO) crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT) INCUE:police will continue to be out there.";
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
foreach(var item in wordList)
{
Console.WriteLine(item);
}
}
private static void GetWords(string content, List<string> wordList, string BEFORE_AND_AFTER_SOUND_EFFECT)
{
if(content.Length>0)
{
if (Regex.IsMatch(content, BEFORE_AND_AFTER_SOUND_EFFECT))
{
int textLength = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index).Length;
int matchLength = Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length;
string[] words = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index - 1).Split(' ');
foreach (var word in words)
{
wordList.Add(word);
}
wordList.Add(content.Substring(Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length));
content = content.Substring((textLength + matchLength) + 1); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
else
{
wordList.Add(content);
content = content.Substring(content.Length); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
}
}
}
Thanks.

Is it possible to batch process range protections in Google Apps Script?

I have to create a dozen protected ranges in a sheet. I have code that works but is very slow because it contacts the server for each range. I know it's possible to work on a local copy of the data if there's some cell processing involved. Is it possible for range protections also?
If it's not, would caching help?
The below code uses the username from the first row as an editor for a bunch of rows in the same column.
var spreadSheet = SpreadsheetApp.getActiveSpreadsheet();
var sheets = spreadSheet.getSheets();
//Set protections per column, we start from the 4th.
for (var i = 4; i <= sheets[3].getLastColumn(); i++){
///Get the username.
var editor = sheets[3].getRange(1, i).getDisplayValue();
//Set the protection.
var protection = sheets[3].getRange(3, i, 22, 1).protect();
protection.setDescription(editor);
//Handle the case of deleted/unknown usernames.
try{
protection.addEditor(editor + '#domain.com');
} catch(error){
protection.addEditor('user#domain.com');
}
}
I've found a solution for a similar issue https://stackoverflow.com/a/37820854 but when I try to apply it to my case I get an error "TypeError: Cannot find function getRange in object Range" so I must be doing something wrong.
var test = [];
for (var i = 4; i <= sheets[3].getLastColumn(); i++){
test.push(sheets[3].getRange(3, i, 22, 1));
}
var editor;
for (var i = 0; i<test.length; i++){
var editor = test[i].getRange(1, 1).getDisplayValue();
}

The syntax for the method getRange() is getRange(row, column, numRows, numColumns), while you counter variable i loops through the COLUMNS instead of ROWS.
If your intention is to loop through all columns and add an editor to each one, it should be something like
for (var i = 4; i <= sheets[3].getLastColumn(); i++){
///Get the username.
var editor = sheets[3].getRange(1, i).getDisplayValue();
//Set the protection.
var protection = sheets[3].getRange(startRow, i, rowNumber, columnNumber).protect();
protection.setDescription(editor);
//Handle the case of deleted/unknown usernames.
try{
protection.addEditor(editor + '#domain.com');
} catch(error){
protection.addEditor('user#domain.com');
}
}

Its possible to do batch processing.
But you'll have to use Advanced Google Services. Check out the Sheets Advanced service and the Sheets API documentation.

Apache PDFBox replace text results in few character missed

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}

Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.

JarowinklerDistance in lucene is returning strange results

I have a file containing some phrases. Using jarowinkler by lucene, it is supposed to get me the most similar phrases of my input from that file.
Here is an example of my problem.
We have a file containing:
//phrases.txt
this is goodd
this is good
this is god
If my input is this is good, it is supposed to get me 'this is good' from the file first, since the similarity score here is the biggest (1). But for some reason, it returns: "this is goodd" and "this is god" only!
Here is my code:
try {
SpellChecker spellChecker = new SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
Dictionary dictionary = new PlainTextDictionary(new File("src/main/resources/words.txt").toPath());
IndexWriterConfig iwc=new IndexWriterConfig(new ShingleAnalyzerWrapper());
spellChecker.indexDictionary(dictionary,iwc,false);
String wordForSuggestions = "this is good";
int suggestionsNumber = 5;
String[] suggestions = spellChecker.suggestSimilar(wordForSuggestions, suggestionsNumber,0.8f);
if (suggestions!=null && suggestions.length>0) {
for (String word : suggestions) {
System.out.println("Did you mean:" + word);
}
}
else {
System.out.println("No suggestions found for word:"+wordForSuggestions);
}
} catch (IOException e) {
e.printStackTrace();
}

suggestSimilar won't provide suggestions which are identical to the input. To quote the source code:
// don't suggest a word for itself, that would be silly
If you want to know whether wordForSuggestions is in the dictionary, use the exist method:
if (spellChecker.exist(wordForSuggestions)) {
//do what you want for an, apparently, correctly spelled word
}

link coming twice while exporting to pdf using itextsharp

my asp boundfield:
<asp:BoundField DataField = "SiteUrl" HtmlEncode="false" HeaderText = "Team Site URL" SortExpression = "SiteUrl" ></asp:BoundField>
My itextsharpcode
for (int i = 0; i < dtUIExport.Rows.Count; i++)
{
for (int j = 0; j < dtUIExport.Columns.Count; j++)
{
if (j == 1)
{ continue; }
string cellText = Server.HtmlDecode(dtUIExport.Rows[i][j].ToString());
// cellText = Server.HtmlDecode((domainGridview.Rows[i][j].FindControl("link") as HyperLink).NavigateUrl);
// string cellText = Server.HtmlDecode((domainGridview.Rows[i].Cells[j].FindControl("hyperLinkId") as HyperLink).NavigateUrl);
iTextSharp.text.Font font = new iTextSharp.text.Font(bf, 10, iTextSharp.text.Font.NORMAL);
font.Color = new BaseColor(domainGridview.RowStyle.ForeColor);
iTextSharp.text.pdf.PdfPCell cell = new iTextSharp.text.pdf.PdfPCell(new Phrase(12, cellText, font));
pdfTable.AddCell(cell);
}
}
domainGridview is the grid name. However I am manipulating the pdf using data table.
The hyperlink is coming in this way
http://dtsp2010vm:47707/sites/TS1>http://dtsp2010vm:47707/sites/TS1
How to rip the addtional link?
Edit: i have added the screenshot of pdf file

Your initial question didn't get an answer because it is rather misleading. You claim link coming twice, but that's not true. From the point of view, the link is shown as HTML syntax:
http://stackoverflow.com
This is the HTML definition of a single link that is stored in the cellText parameter.
You are adding this content to a PdfPCell as if it were a simple string. It shouldn't surprise you that iText renders this string as-is. It would be a serious bug if iText didn't show:
http://stackoverflow.com
If you want the HTML to be rendered, for instance like this: http://stackoverflow.com, you need to parse the HTML into iText objects (e.g. the <a>-tag will result in a Chunk object with an anchor).
Parsing HTML for use in a PdfPCell is explained in the following question: How to add a rich Textbox (HTML) to a table cell?
When you have http://stackoverflow.com, you are talking about HTML, not just ordinary text. There's a big difference.

I wrote this code for achiveing my result. Thanks Bruno for your answer
for (int j = 0; j < dtUIExport.Columns.Count; j++)
{
if (j == 1)
{ continue; }
if (j == 2)
{
String cellTextLink = Server.HtmlDecode(dtUIExport.Rows[i][j].ToString());
cellTextLink = Regex.Replace(cellTextLink, #"<[^>]*>", String.Empty);
iTextSharp.text.Font fontLink = new iTextSharp.text.Font(bf, 10, iTextSharp.text.Font.NORMAL);
fontLink.Color = new BaseColor(domainGridview.RowStyle.ForeColor);
iTextSharp.text.pdf.PdfPCell cellLink = new iTextSharp.text.pdf.PdfPCell(new Phrase(12, cellTextLink, fontLink));
pdfTable.AddCell(cellLink);
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to delete a paragraph using XWPF - Apache POI - apache

I believe your question was answered in this question. When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument: cell.removeParagraph(cell.getParagraphs().indexOf(para));

Related

How to avoid splitting of word with case

Is it possible to batch process range protections in Google Apps Script?

Apache PDFBox replace text results in few character missed

JarowinklerDistance in lucene is returning strange results

link coming twice while exporting to pdf using itextsharp

Categories

Resources