lucene.net performance issue with custom LetterTokenizer - lucene

I'm using lucene.net 3.0.3 and I have a simple customized analyzer and tokenizer which are breaking the terms by TAB. I measured it and this turns out that the indexing is twice as slow as using a StandardAnalyzer (which does many more things). Do you know what the problem might be, or if there is a better solution ?
Code is below
public class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new CustomTokenizer(reader);
//return new LetterTokenizer(reader);
}
public override TokenStream ReusableTokenStream(string fieldName, TextReader reader)
{
Tokenizer tokenizer = this.PreviousTokenStream as Tokenizer;
if (tokenizer == null)
{
tokenizer = new CustomTokenizer(reader);
//tokenizer = new LetterTokenizer(reader);
}
else
{
tokenizer.Reset(reader);
}
return tokenizer;
}
}
public class CustomTokenizer : LetterTokenizer
{
public CustomTokenizer(TextReader reader)
: base(reader)
{ }
protected override char Normalize(char c)
{
return char.ToLower(c, CultureInfo.InvariantCulture);
}
protected override bool IsTokenChar(char c)
{
// TAB has the same code in Unicode
return c != '\x0009';
}
}

I forgot to update this thread : the issue was in the custom analyzer. I did not actually save the tokenizer to PreviousTokenStream , so it was creating a new tokenizer each time.

Related

How to replace / remove text from a PDF file

How would I go about replacing / removing text from a PDF file?
I have a PDF file that I obtained somewhere, and I want to be able to replace some text within it.
Or, I have a PDF file that I want to obscure (redact) some of the text within it so that it's no longer visible [and so that it looks cool, like the CIA files].
Or, I have a PDF that contains global Javascript that I want to stop from interrupting my use of the PDF.
This is possible in a limited fashion with the use of iText / iTextSharp.
It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes).
You need to override the default PdfContentStreamProcessor to act on the page content streams, as presented by Mkl here Removing Watermark from PDF iTextSharp. Inherit from this class, and in your new class look for the Tj/TJ opcodes, the operand(s) will generally be the text element(s) (for a TJ this may not be straightforward text, and may require further parsing of all the operands).
A pretty basic example of some of the flexibility around iTextSharp is available from this github repository https://github.com/bevanweiss/PdfEditor (code excerpts below also)
NOTE: This utilises the AGPL version of iTextSharp (and is hence also AGPL), so if you will be distributing executables derived from this code or allowing others to interact with those executables in any way then you must also provide your modified source code. There is also no warranty, implied or expressed, related to this code. Use at your own peril.
PdfContentStreamEditor
using System.Collections.Generic;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class PdfContentStreamEditor : PdfContentStreamProcessor
{
/**
* This method edits the immediate contents of a page, i.e. its content stream.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void EditPage(PdfStamper pdfStamper, int pageNum)
{
var pdfReader = pdfStamper.Reader;
var page = pdfReader.GetPageN(pageNum);
var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
page.Remove(PdfName.CONTENTS);
EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
}
/**
* This method processes the content bytes and outputs to the given canvas.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
this.Canvas = canvas;
ProcessContent(contentBytes, resources);
this.Canvas = null;
}
/**
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions the {#link ContentOperatorWrapper} instances
* forward to it.
*
* Override this method to achieve some fancy editing effect.
*/
protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
{
var index = 0;
foreach (var pdfObject in operands)
{
pdfObject.ToPdf(null, Canvas.InternalBuffer);
Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor() : base(new DummyRenderListener())
{
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
{
}
//
// Overrides of PdfContentStreamProcessor methods
//
public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
{
var wrapper = new ContentOperatorWrapper();
wrapper.SetOriginalOperator(newOperator);
var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
}
public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
{
this.Resources = resources;
base.ProcessContent(contentBytes, resources);
this.Resources = null;
}
//
// members holding the output canvas and the resources
//
protected PdfContentByte Canvas = null;
protected PdfDictionary Resources = null;
//
// A content operator class to wrap all content operators to forward the invocation to the editor
//
class ContentOperatorWrapper : IContentOperator
{
public IContentOperator GetOriginalOperator()
{
return _originalOperator;
}
public void SetOriginalOperator(IContentOperator op)
{
this._originalOperator = op;
}
public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
if (_originalOperator != null && !"Do".Equals(oper.ToString()))
{
_originalOperator.Invoke(processor, oper, operands);
}
((PdfContentStreamEditor)processor).Write(processor, oper, operands);
}
private IContentOperator _originalOperator = null;
}
//
// A dummy render listener to give to the underlying content stream processor to feed events to
//
class DummyRenderListener : IRenderListener
{
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}
TextReplaceStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextReplaceStreamEditor : PdfContentStreamEditor
{
public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
{
_matchPattern = MatchPattern;
_replacePattern = ReplacePattern;
}
private string _matchPattern;
private string _replacePattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
var operatorString = oper.ToString();
if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
{
for(var i = 0; i < operands.Count; i++)
{
if(!operands[i].IsString())
continue;
var text = operands[i].ToString();
if(Regex.IsMatch(text, _matchPattern))
{
operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
}
}
}
base.Write(processor, oper, operands);
}
}
}
TextRedactStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextRedactStreamEditor : PdfContentStreamEditor
{
public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
{
_matchPattern = MatchPattern;
}
private string _matchPattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
base.Write(processor, oper, operands);
}
public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
base.EditContent(contentBytes, resources, canvas);
}
}
//
// A pretty simple render listener, all we care about it text stuff.
// We listen out for text blocks, look for our text, and then put a
// black box over it.. text 'redacted'
//
class RedactRenderListener : IRenderListener
{
private PdfContentByte _canvas;
private string _matchPattern;
public RedactRenderListener(string MatchPattern)
{
_matchPattern = MatchPattern;
}
public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
{
_canvas = Canvas;
_matchPattern = MatchPattern;
}
public void SetCanvas(PdfContentByte Canvas)
{
_canvas = Canvas;
}
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo)
{
var text = renderInfo.GetText();
var match = Regex.Match(text, _matchPattern);
if(match.Success)
{
var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();
_canvas.SaveState();
_canvas.SetColorStroke(BaseColor.BLACK);
_canvas.SetColorFill(BaseColor.BLACK);
_canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
_canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
_canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
_canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
_canvas.ClosePathFillStroke();
_canvas.RestoreState();
}
}
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Using them with iTextSharp
var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);
pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);
// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;
// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
"TEXT TO REPLACE HERE",
"TEXT TO SUBSTITUTE IN HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
replaceTextProcessor.EditPage(pdfStamper, i);
// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
"TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null,
Encoding.UTF8.GetBytes("ownerPassword"),
PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
PdfWriter.ENCRYPTION_AES_256);
// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";
// and then finally we close our files (saving it in the process)
pdfStamper.Close();
reader.Close();
You can use GroupDocs.Redaction (available for .NET) for replacing or removing the text from PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction (removal) of the text. The following code snippet replaces the word "candy" with "[redacted]" in the loaded PDF document.
C#:
using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
// Save the document to "*_Redacted.*" file.
doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false });
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Lucene does not index some terms in documents

I have been trying to use Lucene to index our code database. Unfortunately, some terms get omitted from the index. E.g. in the below string, I can search on anything other than "version-number":
version-number "cAELimpts.spl SCOPE-PAY:10.1.10 25nov2013kw101730 Setup EMployee field if missing"
I have tried implementing it with both Lucene.NET 3.1 and pylucene 6.2.0, with the same result.
Here are some details of my implementation in Lucene.NET:
using (var writer = new IndexWriter(FSDirectory.Open(INDEX_DIR), new CustomAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED))
{
Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'...");
IndexDirectory(writer, docDir);
Console.Out.WriteLine("Optimizing...");
writer.Optimize();
writer.Commit();
}
The CustomAnalyzer class:
public sealed class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
{
return new LowerCaseFilter(new CustomTokenizer(reader));
}
}
Finally, the CustomTokenizer class:
public class CustomTokenizer : CharTokenizer
{
public CustomTokenizer(TextReader input) : base(input)
{
}
public CustomTokenizer(AttributeFactory factory, TextReader input) : base(factory, input)
{
}
public CustomTokenizer(AttributeSource source, TextReader input) : base(source, input)
{
}
protected override bool IsTokenChar(char c)
{
return System.Char.IsLetterOrDigit(c) || c == '_' || c == '-' ;
}
}
It looks like "version-number" and some other terms are not getting indexed because they are present in 99% of the documents. Can it be the cause of the problem?
EDIT: As requested, the FileDocument class:
public static class FileDocument
{
public static Document Document(FileInfo f)
{
// make a new, empty document
Document doc = new Document();
doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Millisecond, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("contents", new StreamReader(f.FullName, System.Text.Encoding.Default)));
// return the document
return doc;
}
}
I think I was being an idiot. I was limiting the number of hits to 500 and then applying filters on the found hits. The items were expected to be retrieved in the order they had been indexed. So when I was looking for something at the end of the index, it would tell me that nothing was found. In fact, it would retrieve the expected 500 items but they would all have been filtered out.

Lucene Custom Analyzer - TokenStream Contract Violation

I was trying to create my own custom analyzer and tokenizer classes in Lucene. I followed mostly the instructions here:
http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene
And I updated as needed (in Lucene's newer versions the Reader is stored in "input")
However I get an exception:
TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
What could be the reason for this? I gather calling reset\close is not my job at all, but should be done by the analyzer.
Here's my custom analyzer class:
public class MyAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String FieldName){
// TODO Auto-generated method stub
return new TokenStreamComponents(new MyTokenizer());
}
}
And my custom tokenizer class:
public class MyTokenizer extends Tokenizer {
protected CharTermAttribute charTermAttribute =
addAttribute(CharTermAttribute.class);
public MyTokenizer() {
char[] buffer = new char[1024];
int numChars;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
this.input.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
String StringToTokenize = stringBuilder.toString();
Terms=Tokenize(StringToTokenize);
}
public boolean incrementToken() throws IOException {
if(CurrentTerm>=Terms.length)
return false;
this.charTermAttribute.setEmpty();
this.charTermAttribute.append(Terms[CurrentTerm]);
CurrentTerm++;
return true;
}
static String[] Tokenize(String StringToTokenize){
//Here I process the string and create an array of terms.
//I tested this method and it works ok
//In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
return Processed;
}
public void reset() throws IOException {
super.reset();
Terms=null;
CurrentTerm=0;
};
String[] Terms;
int CurrentTerm;
}
When I traced the Exception, I saw that the problem was with input.read - it seems that there is nothing inside input (or rather there is a ILLEGAL_STATE_READER in it) I don't understand it.
You are reading from the input stream in your Tokenizer constructor, before it is reset.
The problem here, I think, is that you are handling the input as a String, instead of as a Stream. The intent is for you to efficiently read from the stream in the incrementToken method, rather than to load the whole stream into a String and pre-process a big ol' list of tokens at the beginning.
It is possible to go this route, though. Just move all the logic currently in the constructor into your reset method instead (after the super.reset() call).

Oracle Coherence index not working with ContainsFilter query

I've added an index to a cache. The index uses a custom extractor that extends AbstractExtractor and overrides only the extract method to return a List of Strings. Then I have a ContainsFilter which uses the same custom extractor that looks for the occurence of a single String in the List of Strings. It does not look like my index is being used based on the time it takes to execute my test. What am I doing wrong? Also, is there some debugging I can switch on to see which indices are used?
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
// returns a List of String objects
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
}
Adding the index:
mCache = CacheFactory.getCache(pCacheName);
mCache.addIndex(new DependencyIdExtractor(), false, null);
Performing the ContainsFilter query:
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}
I solved this by adding a hashCode and equals method implementation to the DependencyIdExtractor class. It is important that you use exactly the same value extractor when adding an index and creating your filter.
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
#Override
public int hashCode() {
return 1;
}
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj instanceof DependencyIdExtractor) {
return true;
}
return false;
}
}
To debug Coherence indices/queries, you can generate an explain plan similar to database query explain plans.
http://www.oracle.com/technetwork/tutorials/tutorial-1841899.html
#SuppressWarnings("unchecked")
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
if (mLog.isTraceEnabled()) {
QueryRecorder agent = new QueryRecorder(RecordType.EXPLAIN);
Object resultsExplain = mCache.aggregate(vContainsFilter, agent);
mLog.trace("resultsExplain = \n" + resultsExplain + "\n");
}
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

filters effect on search results in solr

when i query for "elegant" in solr i get results for "elegance" too.
I used these filters for index analyze
WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory
and for query analyze:
WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
I want to know which filter affecting my search result.
EnglishPorterFilterFactory
Thats the short answer ;)
A little more information:
English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.
You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.
From Solr source:
public void inform(ResourceLoader loader) {
String wordFiles = args.get(PROTECTED_TOKENS);
if (wordFiles != null) {
try {
Here exactly comes the protwords file into play:
File protectedWordFiles = new File(wordFiles);
if (protectedWordFiles.exists()) {
List<String> wlist = loader.getLines(wordFiles);
//This cast is safe in Lucene
protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
} else {
List<String> files = StrUtils
.splitFileNames(wordFiles);
for (String file : files) {
List<String> wlist = loader.getLines(file
.trim());
if (protectedWords == null)
protectedWords = new CharArraySet(wlist,
false);
else
protectedWords.addAll(wlist);
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
Thats the part which affects the stemming. There you see the invocation of the snowball library
public EnglishPorterFilter create(TokenStream input) {
return new EnglishPorterFilter(input, protectedWords);
}
}
/**
* English Porter2 filter that doesn't use reflection to
* adapt lucene to the snowball stemmer code.
*/
#Deprecated
class EnglishPorterFilter extends SnowballPorterFilter {
public EnglishPorterFilter(TokenStream source,
CharArraySet protWords) {
super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
protWords);
}
}