JarowinklerDistance in lucene is returning strange results - lucene

I have a file containing some phrases. Using jarowinkler by lucene, it is supposed to get me the most similar phrases of my input from that file.
Here is an example of my problem.
We have a file containing:
//phrases.txt
this is goodd
this is good
this is god
If my input is this is good, it is supposed to get me 'this is good' from the file first, since the similarity score here is the biggest (1). But for some reason, it returns: "this is goodd" and "this is god" only!
Here is my code:
try {
SpellChecker spellChecker = new SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
Dictionary dictionary = new PlainTextDictionary(new File("src/main/resources/words.txt").toPath());
IndexWriterConfig iwc=new IndexWriterConfig(new ShingleAnalyzerWrapper());
spellChecker.indexDictionary(dictionary,iwc,false);
String wordForSuggestions = "this is good";
int suggestionsNumber = 5;
String[] suggestions = spellChecker.suggestSimilar(wordForSuggestions, suggestionsNumber,0.8f);
if (suggestions!=null && suggestions.length>0) {
for (String word : suggestions) {
System.out.println("Did you mean:" + word);
}
}
else {
System.out.println("No suggestions found for word:"+wordForSuggestions);
}
} catch (IOException e) {
e.printStackTrace();
}

suggestSimilar won't provide suggestions which are identical to the input. To quote the source code:
// don't suggest a word for itself, that would be silly
If you want to know whether wordForSuggestions is in the dictionary, use the exist method:
if (spellChecker.exist(wordForSuggestions)) {
//do what you want for an, apparently, correctly spelled word
}

Related

I have synonym matching working EXCEPT in quoted phrases

Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.
For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.

Multiple synonym matching is not working the way I intend or expect, what am I doing wrong?

I am indexing technical documentation and incorporating synonyms at index time, so that users can search with a number of alternative patterns. But only some synonyms seem to be getting into the map.
I have a text file synonyms.list which contains a series of lines like so:
note,notes,notice
subtree,sub-tree,sub tree
My analyzer and synonym map builder (I've removed try and catch wrappers to save space, but they aren't the problem):
public class TechAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
result = new SynopnymGraphFilter(result, getSynonyms(getSynonymsList()), Boolean.TRUE);
result = new FlattenGraphFilter(result);
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
br.close();
if (cnt > 0) {
synMap = builder.build();
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length > 1) {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
for (String syn : synonymsOfWord) {
addPair(builder, word, syn, keepOrig);
}
}
}
private static void addPair(SynonymMap.Builder builder, String word, String syn, boolean keepOrig) {
CharsRef synp = SynonymMap.Builder.join(syn.split("\\s+"), new CharsRefBuilder());
CharsRef wordp = new CharsRef(word);
builder.add(wordp, synp, keepOrig);
// builder.add(synp, wordp, keepOrig); // ? do I need this??
}
I'm not splitting word in addPair() because (at the moment, anyway) the first term in every line of synonyms.list must be a word not a phrase.
My first question relates to that comment at the bottom of addPair(): if I am adding (word,synonym) to the map, do I also need to add (synonym,word)? Or is the map commutative? I can't tell, because of the problem I'm having which is the basis of the next question.
So... the technical documentation being indexed contains some documents which refer to "release notes", and some which refer to "release notices". There are also points described as a "release note". So I would like a search for any of "release note", "release notes", or "release notice" to match all three alternatives.
My code doesn't seem to enable this. If I index a single file which refers to "release notes" I can inspect the generated index with luke and I can see that the index only ever contains one synonym, not two. The same position in the index might have "note" and "notes", or "notes" and "notice", depending on the order of the words in the synonyms.list text file, but it will never have "note", "notes" and "notice".
Obviously I'm not building the map correctly, but the documentation hasn't helped me see what I am doing wrong.
If you've read this far, and can see the flaw in my code, please help me see it too!
Thanks, etc.

How to delete a paragraph using XWPF - Apache POI

I am trying to delete a paragraph from the .docx document i have generated using the Apache poi XWPF. I can do it easily with the .doc word document using HWPF as below :
for (String paraCount : plcHoldrPargrafDletdLst) {
Paragraph ph = doc.getRange().getParagraph(Integer.parseInt(paraCount));
System.out.println("Deleted Paragraph Start & End: " + ph.getStartOffset() +" & " + ph.getEndOffset());
System.out.println("Deleted Paragraph Test: " + ph.text());
ph.delete();
}
I tried to do the same with
doc.removeBodyElement(Integer.parseInt(paraCount));
But unfortunatley not successful enough to get the result as i want. The result document, i cannot see the paragraph deleted.
Any suggestions on how to accompolish the similar functionality in XWPF.
Ok, this question is a bit old and might not be required anymore, but I just found a different solution than the suggested one.
Hope the following code will help somebody with the same issue
...
FileInputStream fis = new FileInputStream(fileName);
XWPFDocument doc = new XWPFDocument(fis);
fis.close();
// Find a paragraph with todelete text inside
XWPFParagraph toDelete = doc.getParagraphs().stream()
.filter(p -> StringUtils.equalsIgnoreCase("todelete", p.getParagraphText()))
.findFirst().orElse(null);
if (toDelete != null) {
doc.removeBodyElement(doc.getPosOfParagraph(toDelete));
OutputStream fos = new FileOutputStream(fileName);
doc.write(fos);
fos.close();
}
Seems like you're really unable to remove paragraphs from a .docx file.
What you should be able to do is removing the content of paragraphs... So called Runs.You could try with this one:
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (XWPFParagraph paragraph : paragraphs)
{
for (int i = 0; i < paragraph.getRuns().size(); i++)
{
paragraph.removeRun(i);
}
}
You can also specify which Run of which Paragraph should be removed e.g.
paragraphs.get(23).getRuns().remove(17);
all rights reserved
// Remove all existing runs
removeRun(para, 0);
public static void removeRun(XWPFParagraph para, int depth)
{
if(depth > 10)
{
return;
}
int numberOfRuns = para.getRuns().size();
// Remove all existing runs
for(int i = 0; i < numberOfRuns; i++)
{
try
{
para.removeRun(numberOfRuns - i - 1);
}
catch(Exception e)
{
//e.printStackTrace();
}
}
if(para.getRuns().size() > 0)
{
removeRun(para, ++depth);
}
}
I like Apache POI, and for the most part its great, however I have found the documentation a little scatty to say the least.
The elusive way of deleting a paragraph I found to be quite a nightmare, giving me the following exception error when try to remove a paragraph:
java.util.ConcurrentModificationException
As mention in Ugo Delle Donne example, I solved this by first recording the paragraph that I wanted to delete, and then using the removeBodyElement method the document.
e.g.
List<XWPFParagraph> record = new ArrayList<XWPFParagraph>();
String text = "";
for (XWPFParagraph p : doc.getParagraphs()){
for (XWPFRun r : p.getRuns()){
text += r.text();
// I saw so many examples as r.getText(pos), don't use that
// Find some unique text in the paragraph
//
if (!(text==null) && (text.contains("SOME-UNIQUE-TEXT")) {
// Save the Paragraph to delete for later
record.add( p );
}
}
}
// Now delete the paragraph and anything within it.
for(int i=0; i< record.size(); i++)
{
// Remove the Paragraph and everything within it
doc.removeBodyElement(doc.getPosOfParagraph( record.get(i) ));
}
// Shaaazam, I hope this helps !
I believe your question was answered in this question.
When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));

Lucene 4.0 sample code

I can't get this to work with Lucene 4.0 and its new features... Could somebody please help me??
I have crawled a bunch of html-documents from the web. Now I would like to count the number of distinct words of every Document.
This is how I did it with Lucene 3.5 (for a single document. To get them all I loop over all documents... every time with a new RAMDirectory containing only one doc) :
Analyzer analyzer = some Lucene Analyzer;
RAMDirectory index;
index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
String _words = new String();
// get somehow the String containing a certain text:
_words = doc.getPageDescription();
try {
IndexWriter w = new IndexWriter(index, config);
addDoc(w, _words);
w.close();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
try {
// System.out.print(", count Terms... ");
IndexReader reader = IndexReader.open(index);
TermFreqVector[] freqVector = reader.getTermFreqVectors(0);
if (freqVector == null) {
System.out.println("Count words: ": 0");
}
for (TermFreqVector vector : freqVector) {
String[] terms = vector.getTerms();
int[] freq = vector.getTermFrequencies();
int n = terms.length;
System.out.println("Count words: " + n);
....
How can I do this with Lucene 4.0?
I'd prefer to do this using a FSDirectory instead of RAMDirectory however; I guess this is more performant if I have a quite high number of documents?
Thanks and regards
C.
Use the Fields/Terms apis.
See especially the example 'access term vector fields for a specific document'
Seeing as you are looping over all documents, if your end goal is really something like the average number of unique terms across all documents, keep reading to the 'index statistics section'. For example in that case, you can compute that efficiently with #postings / #documents: getSumDocFreq()/maxDoc()
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/package-summary.html#package_description

Unable to update a list item from a workflow task in C#

I am not getting any exceptions, but the code below is simply not working. Any ideas?
SPSecurity.RunWithElevatedPrivileges(delegate() {
using (SPWeb web = this.workflowProperties.Web) {
try {
SPListItem item = web.Lists["NewHireFormsLibrary"].Items[workflowProperties.ItemId - 1];
item["Field 1"] = "Gotcha!!!";
item.Update();
LogHistory("Information", "Workflow indexing complete. " + item["Field 1"], "");
}
catch (Exception ex) {
LogHistory("Error", ex.Message, ex.StackTrace);
}
}
)};
It looks like you are not referencing the field by it's Internal Name, which is how you have to reference fields when accessing them with the SPListItem's indexer. Try something like
item["Field_x0020_1"] = "Gotcha!!!";
and it should work. Note that Internal names never contain spaces and are replaced by their hex character string like above.