Using ShingleFilter to build costomized analyzer in PyLucene - lucene

I am pretty new to Lucene and Pylucene. This is a problem when I am using pylucene to write a customized analyzer, to tokenize text in to bigrams.
The code for analyzer class is:
class BiGramShingleAnalyzer(PythonAnalyzer):
def __init__(self, outputUnigrams=False):
PythonAnalyzer.__init__(self)
self.outputUnigrams = outputUnigrams
def tokenStream(self, field, reader):
result = ShingleFilter(LowerCaseTokenizer(Version.LUCENE_35,reader))
result.setOutputUnigrams(self.outputUnigrams)
#print 'result is', result
return result
I used ShingleFilter on the TokenStream produced by LowerCaseTokeinizer. When I call the tokenStream function directly, it works just tine:
str = ‘divide this sentence'
bi = BiGramShingleAnalyzer(False)
sf = bi.tokenStream('f', StringReader(str))
while sf.incrementToken():
print sf
(divide this,startOffset=0,endOffset=11,positionIncrement=1,type=shingle)
(this sentence,startOffset=7,endOffset=20,positionIncrement=1,type=shingle)
But when I tried to build a query parser using this analyzer, problem occurred:
parser = QueryParser(Version.LUCENE_35, 'f', bi)
query = parser.parse(str)
In query there is nothing.
After I add print statement in the tokenStream function, I found when I call parser.parse(str), the print statement in tokenStream actually get called 3 times (3 words in my str variable). It seems to me the parser pre-processed the str I passed to it, and call the tokenStream function on the result of the pre-processing.
Any thoughts on how should I make the analyzer work, so that when I pass it to query parser, the parser could parse a string into bigrams?
Thanks in advance!

Related

Lucene Porter Stemmer - get original unstemmed word

I have worked out how to use Lucene's Porter Stemmer but would like to also retrieve the original, un-stemmed word. So, to this end, I added a CharTermAttribute to the TokenStream before creating the PorterStemFilter, as follows:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream stemmed = new PorterStemFilter(original);
CharTermAttribute originalWordAttribute = original.addAttribute(CharTermAttribute.class);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
System.out.println(stemmedWordAttribute+" "+originalWordAttribute);
}
Unfortunately, both attributes return the stemmed word.
Is there a way to get the original word as well?
Lucene's PorterStemFilter can be combined with Lucene's KeywordRepeatFilter. The Porter Stemmer uses this to provide both the stemmed and unstemmed tokens.
Modifying your approach:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream repeated = new KeywordRepeatFilter(original);
TokenStream stemmed = new PorterStemFilter(repeated);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
String originalWord = stemmedWordAttribute.toString();
stemmed.incrementToken();
String stemmedWord = stemmedWordAttribute.toString();
System.out.println(originalWord + " " + stemmedWord);
}
This is fairly crude, but shows the approach.
Example input:
testing giraffe book passing
Resulting output:
testing test
giraffe giraff
book book
passing pass
For each pair of tokens, if the second matches the first (book book), then there was no stemming.
Normally, you would use this with RemoveDuplicatesTokenFilter to remove the duplicate book term - but if you do that I think it becomes much harder to track the stemmed/unstemmed pairs - so for your specific scenario, I did not use that de-duplication filter.

Converting gremlin query from gremlin console to Bytecode

I am trying to convert gremlin query received from gremlin console to bytecode in order to extract StepInstructions. I am using the below code to do that but it looks hacky and ugly to me. Is there any better way of converting gremlin query from gremlin console to Bytecode?
String query = (String) requestMessage.getArgs().get(Tokens.ARGS_GREMLIN);
final GremlinGroovyScriptEngine engine = new GremlinGroovyScriptEngine();
CompiledScript compiledScript = engine.compile(query);
final Graph graph = EmptyGraph.instance();
final GraphTraversalSource g = graph.traversal();
final Bindings bindings = engine.createBindings();
bindings.put("g", g);
DefaultGraphTraversal graphTraversal = (DefaultGraphTraversal) compiledScript.eval(bindings);
Bytecode bytecode = graphTraversal.getBytecode();
If you need to take a Gremlin string and convert it to Bytecode I don't think there is a much better way to do that. You must pass the string through a GremlinGroovyScriptEngine to evaluate it into an actual Traversal object that you can manipulate. The only improvement that I can think of would be to call eval() more directly:
// construct all of this once and re-use it for your application
final GremlinGroovyScriptEngine engine = new GremlinGroovyScriptEngine();
final Graph graph = EmptyGraph.instance();
final GraphTraversalSource g = graph.traversal();
final Bindings bindings = engine.createBindings();
bindings.put("g", g);
//////////////
String query = (String) requestMessage.getArgs().get(Tokens.ARGS_GREMLIN);
DefaultGraphTraversal graphTraversal = (DefaultGraphTraversal) engine.eval(query, bindings);
Bytecode bytecode = graphTraversal.getBytecode();

What is an alternative for Lucene Query's extractTerms?

In Lucene 4.6.0 there was the method extractTerms that provided the extraction of terms from a query (Query 4.6.0). However, from Lucene 6.2.1, it does no longer exist (Query Lucene 6.2.1). Is there a valid alternative for it?
What I'd need is to parse terms (and corrispondent fields) of a Query built by QueryParser.
Maybe not the best answer but one way is to use the same analyzer and tokenize the query string:
Analyzer anal = new StandardAnalyzer();
TokenStream ts = anal.tokenStream("title", query); // string query
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
System.out.println(termAtt.toString());
}
anal.close();
I have temporarely solved my problem with the following code. Smarter alternatives will be well accepted:
QueryParser qp = new QueryParser("title", a);
Query q = qp.parse(query);
Set<Term> termQuerySet = new HashSet<Term>();
Weight w = searcher.createWeight(q, true, 3.4f);
w.extractTerms(termQuerySet);

why test SynonymGraphFilter Lucene doesn't work?

I trying to test synonym Graph but doesn't work how as i expected and don't return the correct answer.
This is the createComponents custom method in my custom analyzer
public SuggestAnalizer(SynonymMap synonymMap) {
this.synonymMap = synonymMap;
this.stopList = Collections.emptyList();
}
#Override
protected TokenStreamComponents createComponents(String s) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new SynonymGraphFilter(tokenizer, synonymMap, true);
tokenStream = new FlattenGraphFilter(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
This is the Test code
String entrada = "ALCALDE KOOPER";
String salida = "FEDERICO COOPER";
SynonymMap.Builder builder = new SynonymMap.Builder(true);
CharsRef input = SynonymMap.Builder.join(entrada.split(" "), new CharsRefBuilder());
CharsRef output = SynonymMap.Builder.join(salida.split(" "), new CharsRefBuilder());
builder.add(output, input, true);
suggestAnalizer = new SuggestAnalizer(builder.build());
TokenStream tokenStream = suggestAnalizer.tokenStream("field", entrada2);
assertTokenStreamContents(tokenStream, new String[]{
"FEDERICO"
});
assertAnalyzesTo(suggestAnalizer, entrada, new String[]{
"FEDERICO"
});
I expected the assertion work changing the "ALCALDE KOOPER" string for her synonym "FEDERICO COOPER", but this doesn't happen.
Someone know where is my error or why my code doesn't work?
The reason for these behaviour, is that you add multiword synonym from
FEDERICO COOPER to ALCALDE KOOPER (in the code, I saw adding link from output (which is FEDERICO COOPER) to input, which is ALCALDE KOOPER)
Later you're testing synonyms for a token FEDERICO, but there is no connection from it, that's why you get empty response and an assertion error. So, if you will add synonyms from FEDERICO to ALCALDE.
But, even if you will do that, there is a mistake with building SynonymMap, you used ignoreCase param with true value, which means:
case-folds input for matching with Character#toLowerCase(int).
Note, if you set this to true, it's your responsibility to lowercase the input entries when you create the SynonymMap
So, rather you need to use lowercased version in testing or set ignoreCase to false
You could check a reference code here

Lucene MultiFieldQuery with WildcardQuery

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)