Words normalization using RDD

Words normalization using RDD - lucene

Maybe this question is a little bit strange... But I'll try to ask it.
Everyone, who wrote applications with using Lucene API, seen something like this:
public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException
{
TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text));
tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));
tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream);
tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream);
tokenStream.reset();
String result = "";
while (tokenStream.incrementToken())
{
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
try
{
//normalizer.getNormalForm(...) - stemmer or lemmatizer
result += normalizer.getNormalForm(token.toString()) + " ";
}
catch(Exception e)
{
//if something went wrong
}
}
return result;
}
Is it possible to rewrite words normalization using RDD?
Maybe someone have an example of this transformation, or can specify web resource about it?
Thank You.

I recently used a similar example for a talk. It shows how to remove the stop words. It has no normalization phase, but if that normalizer.getNormalForm comes from a lib that can be reused, it should be easy to integrate.
This code could be a starting point:
// source text
val rdd = sc.textFile(...)
// stop words src
val stopWordsRdd = sc.textFile(...)
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd)
val stopWords = stopWordsRdd.collect.toSet
val stopWordsBroadcast = sc.broadcast(stopWords)
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase))
val cleaned = words.mapPartitions{iterator =>
val stopWordsSet = stopWordsBroadcast.value
iterator.filter(elem => !stopWordsSet.contains(elem))
}
// plug the normalizer function here
val normalized = cleaned.map(normalForm(_))
Note: This is from the Spark job point of view. I'm not familiar with Lucene.

Related

Seaching for product codes, phone numbers in Lucene

I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?
I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:
public class IdentifierAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return tokenStream;
}
}
So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.

This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.
Here is my custom analyzer, using a WordDelimiterGraphFilter:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;
public class IdentifierAnalyzer extends Analyzer {
private WordDelimiterGraphFilterFactory getWordDelimiter() {
Map<String, String> settings = new HashMap<>();
settings.put("generateWordParts", "1"); // e.g. "PowerShot" => "Power" "Shot"
settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
settings.put("catenateAll", "1"); // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
settings.put("preserveOriginal", "1"); // e.g. "500-42" => "500" "42" "500-42"
settings.put("splitOnCaseChange", "1"); // e.g. "fooBar" => "foo" "Bar"
return new WordDelimiterGraphFilterFactory(settings);
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = getWordDelimiter().create(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
return tokenStream;
}
}
It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.
You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.
Here is a test index builder for the following 3 input values:
978-3-86680-192-9
TS 123
123.abc
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new IdentifierAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String identifier : identifiers) {
doc = new Document();
doc.add(new TextField("identifiers", identifier, Field.Store.YES));
writer.addDocument(doc);
}
}
}
This creates the following tokens:
For querying the above indexed data I used this:
public static void doSearch() throws IOException, ParseException {
Analyzer analyzer = new IdentifierAnalyzer();
QueryParser parser = new QueryParser("identifiers", analyzer);
List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
for (String search : searches) {
Query query = parser.parse(search);
printHits(query, search);
}
}
private static void printHits(Query query, String search) throws IOException {
System.out.println("search term: " + search);
System.out.println("parsed query: " + query.toString());
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("hits: " + hits.length);
for (ScoreDoc hit : hits) {
System.out.println("");
System.out.println(" doc id: " + hit.doc + "; score: " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println(" identifier: " + doc.get("identifiers"));
}
System.out.println("-----------------------------------------");
}
This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):
9783
9783*
978 3
978-3
TS1*
TS 1*
The only query which failed to find any matching documents was the first one:
search term: 9783
parsed query: identifiers:9783
hits: 0
This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.
The final query I tested TS 1* found three hits - but the one we want has the best matching score:
search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3
doc id: 1; score: 1.590861
identifier: TS 123
doc id: 0; score: 1.0
identifier: 978-3-86680-192-9
doc id: 2; score: 1.0
identifier: 123.abc

Base64 encoder in Kotlin

can you help me with refactoring this method from java to kotlin, please ? I have a problem with ByteBuffer in return statement. Looks like in Kotlin it doesnt work that way.
public String encryptData(Object payload, Key secretKey)
throws NoSuchAlgorithmException, NoSuchPaddingException, JsonProcessingException, BadPaddingException,
IllegalBlockSizeException, InvalidAlgorithmParameterException, InvalidKeyException {
var initialVector = new byte[INITIAL_VECTOR_SIZE];
secureRandom.nextBytes(initialVector);
var cipher = Cipher.getInstance(AES_TRANSFORMATION);
cipher.init(Cipher.ENCRYPT_MODE, secretKey, new IvParameterSpec(initialVector));
var data = cipher.doFinal(objectMapper.writeValueAsString(payload).getBytes());
var byteBuffer = ByteBuffer.allocate(initialVector.length + data.length);
return new String(
Base64.getEncoder()
.encode(byteBuffer
.put(0, initialVector)
.put(INITIAL_VECTOR_SIZE, data))
.array());
}

It's cleaner to setup your byteBuffer outside of base 64 conversion. Once you have done that, you can:
...
return Base64.getEncoder().encodeToString(byteBuffer)

Lucene, relevance/scoring for an in-memory string

I am building a bot that monitors HN for topics that I am interested in.
I'd like to analyze an in-memory string, and determine if it contains some keywords that I am interested in.
I'd like it to take into consideration the things that Lucene does when performing a standard query (word stemming, stop words, normalizing punctuation, etc).
I could probably build an in-memory index, and query it using the normal approach, but is there a way that I can use the internals of Lucene to avoid a needless index being built?
Bonus points if I can get a relevance value (0.0-1.0), instead of just a true/false value.
Pseudo code:
public static decimal IsRelevant(string keywords, string input)
{
// Does the "input" variable look like it contains "keywords"?
}
IsRelevant("books", "I just bought a book, and I like it."); // matching!
IsRelevant("book", "I just bought many books!"); // matching!

I created a solution using an in-memory search index. It's not ideal, but it does the task.
public static float RelevanceScore(string keyword, string input)
{
var directory = new RAMDirectory();
var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
using (var writer = new IndexWriter(directory, new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)))
{
var doc = new Document();
doc.Add(new Field("input", input, Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(doc);
writer.Commit();
}
using (var reader = IndexReader.Open(directory))
{
var searcher = new IndexSearcher(reader);
var parser = new QueryParser(LuceneVersion.LUCENE_48, "input", analyzer);
var query = parser.Parse(keyword);
var result = searcher.Search(query, null, 10);
if (result.ScoreDocs.Length == 0)
{
return 0;
}
var doc = result.ScoreDocs.Single();
return doc.Score;
}
}

Lucene.Net 4.8 add multiple filters to custom analyzer

I'm attempting to create a custom analyser with multiple filters applied.
The issue is only the last filter (LowerCaseFilter) is applied.
public class CustomAnalyzer : Analyzer
{
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
Tokenizer tokenizer = new KeywordTokenizer(reader);
//Remove basic stop words a, an, the, in, on etc
TokenStream result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
////Remove tile/tiles
CharArraySet stopWords = new CharArraySet(GlobalVariables.LuceneVersion, 1, true)
{
"test",
}
result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, stopWords);
//Make case insenstive
result = new LowerCaseFilter(GlobalVariables.LuceneVersion, tokenizer);
return new TokenStreamComponents(tokenizer, result);
}
}

Don't pass the tokenizer into each filter, pass the previous filter in.
Tokenizer tokenizer = new KeywordTokenizer(reader);
TokenStream result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
CharArraySet stopWords = new CharArraySet(GlobalVariables.LuceneVersion, 1, true)
result = new StopFilter(GlobalVariables.LuceneVersion, result, stopWords);
result = new LowerCaseFilter(GlobalVariables.LuceneVersion, result);
return new TokenStreamComponents(tokenizer, result);

Lucene Full Text search Engine Did you mean feature

How to implement Did You Mean and Spellchecker feature in lucene full text search engine.

After you've created the index you can can create the index with the dictionary used by the spell checker using:
public void createSpellChekerIndex() throws CorruptIndexException,
IOException {
final IndexReader reader = IndexReader.open(this.indexDirectory, true);
final Dictionary dictionary = new LuceneDictionary(reader,
LuceneExample.FIELD);
final SpellChecker spellChecker = new SpellChecker(this.spellDirectory);
final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig writerConfig = new IndexWriterConfig(
Version.LUCENE_36, analyzer);
spellChecker.indexDictionary(dictionary, writerConfig, true);
spellChecker.close();
}
and than ask for a suggestions array with:
public String[] getSuggestions(final String queryString,
final int numberOfSuggestions, final float accuracy) {
try {
final SpellChecker spellChecker = new SpellChecker(
this.spellDirectory);
final String[] similarWords = spellChecker.suggestSimilar(
queryString, numberOfSuggestions, accuracy);
return similarWords;
} catch (final Exception e) {
return new String[0];
}
}
Example:
After indexing the following document:
luceneExample.index("spell checker");
luceneExample.index("did you mean");
luceneExample.index("hello, this is a test");
luceneExample.index("Lucene is great");
And creating the spell index with the method above, i tried to search for the string "lucete" and, asking for suggestion with
final String query = "lucete";
final String[] suggestions = luceneExample.getSuggestions(query, 5,
0.2f);
System.out.println("Did you mean:\n" + Arrays.toString(suggestions));
This was the output:
Did you mean:
[lucene]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Words normalization using RDD - lucene

Related

Seaching for product codes, phone numbers in Lucene

Base64 encoder in Kotlin

Lucene, relevance/scoring for an in-memory string

Lucene.Net 4.8 add multiple filters to custom analyzer

Lucene Full Text search Engine Did you mean feature

Categories

Resources