Apache PDFBox replace text results in few character missed

Apache PDFBox replace text results in few character missed - pdf

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}

Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.

Related

How to avoid splitting of word with case

Below is the text that I want to split:
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse'
(TAKE SOT)"
string[] contentList = content.split(' ');
I do not want to split the word i.e (TAKE SOT), if there is a text within parentheses and in an upper case then how to avoid splitting of the part.
Thanks

The split method can have two parameters.The first one delimits the substrings in this instance.The second one is the maximum number of elements expected in the array.
The following code snippet work for me, you can refer to it.
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT)";
string[] contents = content.Split(" ", content.Substring(0, content.IndexOf("(")).Split(" ").Length);
And this is the result of the code:
I use the method Split(String, Int32, StringSplitOptions)

Later after doing some research I found a solution for the problem:
public class Program
{
public static void Main()
{
string BEFORE_AND_AFTER_SOUND_EFFECT = #"\(([A-Z\s]*)\)";
List<string> wordList = new List<string>();
string content = "Tonight the warning from the Police commissioner as they (VO) crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT) INCUE:police will continue to be out there.";
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
foreach(var item in wordList)
{
Console.WriteLine(item);
}
}
private static void GetWords(string content, List<string> wordList, string BEFORE_AND_AFTER_SOUND_EFFECT)
{
if(content.Length>0)
{
if (Regex.IsMatch(content, BEFORE_AND_AFTER_SOUND_EFFECT))
{
int textLength = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index).Length;
int matchLength = Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length;
string[] words = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index - 1).Split(' ');
foreach (var word in words)
{
wordList.Add(word);
}
wordList.Add(content.Substring(Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length));
content = content.Substring((textLength + matchLength) + 1); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
else
{
wordList.Add(content);
content = content.Substring(content.Length); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
}
}
}
Thanks.

link coming twice while exporting to pdf using itextsharp

my asp boundfield:
<asp:BoundField DataField = "SiteUrl" HtmlEncode="false" HeaderText = "Team Site URL" SortExpression = "SiteUrl" ></asp:BoundField>
My itextsharpcode
for (int i = 0; i < dtUIExport.Rows.Count; i++)
{
for (int j = 0; j < dtUIExport.Columns.Count; j++)
{
if (j == 1)
{ continue; }
string cellText = Server.HtmlDecode(dtUIExport.Rows[i][j].ToString());
// cellText = Server.HtmlDecode((domainGridview.Rows[i][j].FindControl("link") as HyperLink).NavigateUrl);
// string cellText = Server.HtmlDecode((domainGridview.Rows[i].Cells[j].FindControl("hyperLinkId") as HyperLink).NavigateUrl);
iTextSharp.text.Font font = new iTextSharp.text.Font(bf, 10, iTextSharp.text.Font.NORMAL);
font.Color = new BaseColor(domainGridview.RowStyle.ForeColor);
iTextSharp.text.pdf.PdfPCell cell = new iTextSharp.text.pdf.PdfPCell(new Phrase(12, cellText, font));
pdfTable.AddCell(cell);
}
}
domainGridview is the grid name. However I am manipulating the pdf using data table.
The hyperlink is coming in this way
http://dtsp2010vm:47707/sites/TS1>http://dtsp2010vm:47707/sites/TS1
How to rip the addtional link?
Edit: i have added the screenshot of pdf file

Your initial question didn't get an answer because it is rather misleading. You claim link coming twice, but that's not true. From the point of view, the link is shown as HTML syntax:
http://stackoverflow.com
This is the HTML definition of a single link that is stored in the cellText parameter.
You are adding this content to a PdfPCell as if it were a simple string. It shouldn't surprise you that iText renders this string as-is. It would be a serious bug if iText didn't show:
http://stackoverflow.com
If you want the HTML to be rendered, for instance like this: http://stackoverflow.com, you need to parse the HTML into iText objects (e.g. the <a>-tag will result in a Chunk object with an anchor).
Parsing HTML for use in a PdfPCell is explained in the following question: How to add a rich Textbox (HTML) to a table cell?
When you have http://stackoverflow.com, you are talking about HTML, not just ordinary text. There's a big difference.

I wrote this code for achiveing my result. Thanks Bruno for your answer
for (int j = 0; j < dtUIExport.Columns.Count; j++)
{
if (j == 1)
{ continue; }
if (j == 2)
{
String cellTextLink = Server.HtmlDecode(dtUIExport.Rows[i][j].ToString());
cellTextLink = Regex.Replace(cellTextLink, #"<[^>]*>", String.Empty);
iTextSharp.text.Font fontLink = new iTextSharp.text.Font(bf, 10, iTextSharp.text.Font.NORMAL);
fontLink.Color = new BaseColor(domainGridview.RowStyle.ForeColor);
iTextSharp.text.pdf.PdfPCell cellLink = new iTextSharp.text.pdf.PdfPCell(new Phrase(12, cellTextLink, fontLink));
pdfTable.AddCell(cellLink);
}

How to get Document ids for Document Term Vector in Lucene

I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.
/**
* Sums the term frequency vector of each document into a single term frequency map
* #param indexReader the index reader, the document numbers are specific to this reader
* #param docNumbers document numbers to retrieve frequency vectors from
* #param fieldNames field names to retrieve frequency vectors from
* #param stopWords terms to ignore
* #return a map of each term to its frequency
* #throws IOException
*/
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);
for (Integer docNum : docNumbers) {
for (String fieldName : fieldNames) {
TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
if (tfv == null) {
// ignore empty fields
continue;
}
String terms[] = tfv.getTerms();
int termCount = terms.length;
int freqs[] = tfv.getTermFrequencies();
for (int t=0; t < termCount; t++) {
String term = terms[t];
int freq = freqs[t];
// filter out single-letter words and stop words
if (StringUtils.length(term) < 2 ||
stopWords.contains(term)) {
continue; // stop
}
Integer totalFreq = totalTfv.get(term);
totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
totalTfv.put(term, totalFreq);
}
}
}
return totalTfv;
}
I have created the index which resides in the following directory.
String indexDir = "C:\\Lucene\\Output\\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);
My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like
TermDocs docs = reader.termDocs();
but it did not work.

Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):
for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
if (reader.isDeleted(docNum)) {
continue;
}
TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
...
}
For this to work, you have to enable them during indexing, see Field.TermVector.

Lucene: how to preserve whitespaces etc when tokenizing stream?

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
Term1: Term2 Stopword! Term3
Term4
then I want the output to look like
Term1': Term2' Stopword! Term3'
Term4'
(where Termi' is translation of Termi) instead of simply
Term1' Term2' Term3' Term4'
Currently I am doing the following:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
============ UPDATE!
I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
}

Well it doesn't really lose whitespace, you still have your original text :)
So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.
I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:
Just a test of some ideas. Let's see if it works.
The output is:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}

How do I save RichTextBox content into SQL varbinary (byte array) column?

I want to save content of a RichTextBox to varbinary (= byte array) in XamlPackage format.
I need technicial advise on how to it.
I actually need to know how to convert between FlowDocument to byte array.
Is it even recommended to store it as varbinary, or this is a bad idea?
Update
Code snippet:
///Load
byte[] document = GetDocumentFromDataBase();
RickTextBox tb = new RickTextBox();
TextRange tr = new TextRange(tb.Document.ContentStart, tb.Document.ContentEnd)
tr.Load(--------------------------) //Load from the byte array.
///Save
int maxAllowed = 1024;
byte[] document;
RichTextBox tb = new RichTextBox();
//User entered text and designs in the rich text
TextRange tr = new TextRange(tb.Document.ContentStart, tb.Document.ContentEnd)
tr.Save(--------------------------) //Save to byte array
if (document.Length > maxAllowed)
{
MessageBox.Show((document.Length - maxAllowed) + " Exceeding limit.");
return;
}
SaveToDataBase();
TextRange

I can't find my full example right now, but you can use XamlReader and XamlWriter to get the document into and out of a string. From there, you can use UnicodeEncoding, AsciiEncoding or whatever encoder you want to get it into and out of bytes.
My shorter example for setting the document from a string...
docReader is my flow document reader
private void SetDetails(string detailsString)
{
if (docReader == null)
return;
if (String.IsNullOrEmpty(detailsString))
{
this.docReader.Document = null;
return;
}
using (
StringReader stringReader = new StringReader(detailsString))
{
using (System.Xml.XmlReader reader = System.Xml.XmlReader.Create(stringReader))
{
this.docReader.Document = XamlReader.Load(reader) as FlowDocument;
}
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache PDFBox replace text results in few character missed - pdf

Related

How to avoid splitting of word with case

link coming twice while exporting to pdf using itextsharp

How to get Document ids for Document Term Vector in Lucene

Lucene: how to preserve whitespaces etc when tokenizing stream?

How do I save RichTextBox content into SQL varbinary (byte array) column?

Categories

Resources