How to add options for Analyze in Apache Lucene? - lucene

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.

Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.

Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?

Related

Java 8 map with Map.get nullPointer Optimization

public class StartObject{
private Something something;
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class Something{
private Set<SomeObject> someObjecSet;
}
public class SomeObject {
private AnotherObject anotherObjectSet;
}
public class AnotherObject{
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class ObjectThatMatters{
private Long id;
}
private void someMethod(StartObject startObject) {
Map<Long, ObjectThatMatters> objectThatMattersMap = StartObject.getSomething()
.getSomeObject.stream()
.map(getSomeObject::getAnotherObject)
.flatMap(anotherObject-> anotherObject.getObjectThatMattersSet().stream())
.collect(Collectors.toMap(ObjectThatMatters -> ObjectThatMatters.getId(), Function.identity()));
Set<ObjectThatMatters > dbObjectThatMatters = new HashSet<>();
try {
dbObjectThatMatters.addAll( tartObject.getObjectThatMatters().stream().map(objectThatMatters-> objectThatMattersMap .get(objectThatMatters.getId())).collect(Collectors.toSet()));
} catch (NullPointerException e) {
throw new someCustomException();
}
startObject.setObjectThatMattersSet(dbObjectThatMatters);
Given a StartObject that contains a set of ObjectThatMatters
And a Something that contains the database structure already fetched filled with all valid ObjectThatMatters.
When I want to swap the StartObject set of ObjectThatMatters to the valid corresponding db objects that only exist in the scope of the Something
Then I compare the set of ObjectThatMatters on the StartObject
And replace every one of them with the valid ObjectThatMatters inside the Something object
And If some ObjectThatMatters doesn't have a valid ObjectThatMatters I throw a someCustomException
This someMethod seems pretty horrible, how can I make it more readable?
Already tried to change the try Catch to a optional but that doesn't actually help.
Used a Map instead of a List with List.contains because of performance, was this a good idea? The total number of ObjectThatMatters will be usually 500.
I'm not allowed to change the other classes structure and I'm only showing you the fields that affect this method not every field since they are extremely rich objects.
You don’t need a mapping step at all. The first operation, which produces a Map, can be used to produce the desired Set in the first place. Since there might be more objects than you are interested in, you may perform a filter operation.
So first, collect the IDs of the desired objects into a set, then collect the corresponding db objects, filtering by the Set of IDs. You can verify whether all IDs have been found, by comparing the resulting Set’s size with the ID Set’s size.
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId).collect(Collectors.toSet());
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size())
throw new SomeCustomException();
startObject.setObjectThatMattersSet(objectThatMattersSet);
}
This code produces a HashSet; if this is not a requirement, you can just use Collectors.toSet() to get an arbitrary Set implementation.
It’s even easy to find out which IDs were missing:
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId)
.collect(Collectors.toCollection(HashSet::new));// ensure mutable Set
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size()) {
objectThatMattersSet.stream().map(ObjectThatMatters::getId).forEach(id::remove);
throw new SomeCustomException("The following IDs were not found: "+id);
}
startObject.setObjectThatMattersSet(objectThatMattersSet);
}

Does this saving/loading pattern have a name?

There's a variable persistence concept I have integrated multiple times:
// Standard initialiation
boolean save = true;
Map<String, Object> dataHolder;
// variables to persist
int number = 10;
String text = "I'm saved";
// Use the variables in various ways in the project
void useVariables() { ... number ... text ...}
// Function to save the variables into a datastructure and for example write them to a file
public Map<String, Object> getVariables()
{
Map<String, Object> data = new LinkedHashMap<String, Object>();
persist(data);
return(data);
}
// Function to load the variables from the datastructure
public void setVariables(Map<String, Object> data)
{
persist(data);
}
void persist(Map<String, Object> data)
{
// If the given datastructure is empty, it means data should be saved
save = (data.isEmpty());
dataHolder = data;
number = handleVariable("theNumber", number);
text = handleVariable("theText", text);
...
}
private Object handleVariable(String name, Object value)
{
// If currently saving
if(save)
dataHolder.put(name, value); // Just add to the datastructure
else // If currently writing
return(dataHolder.get(name)); // Read and return from the datastruct
return(value); // Return the given variable (no change)
}
The main benefit of this principle is that you only have a single script where you have to mention new variables you add during the development and it's one simple line per variable.
Of course you can move the handleVariable() function to a different class which also contains the "save" and "dataHolder" variables so they wont be in the main application.
Additionally you could pass meta-information, etc. for each variable required for persisting the datastructure to a file or similar by saving a custom class which contains this information plus the variable instead of the object itself.
Performance could be improved by keeping track of the order (in another datastructure when first time running through the persist() function) and using a "dataHolder" based on an array instead of a search-based map (-> use an index instead of a name-string).
However, for the first time, I have to document this and so I wondered whether this function-reuse principle has a name.
Does someone recognize this idea?
Thank you very much!

HBase secondary index with observer coprocessor, .put on the index table results in recursion

In HBase database I want to create a secondary index by using additional "linking" table. I have followed the example given in this answer: Create secondary index using coprocesor HBase
I am not very well familiar with the entire concept of HBase, and I had read some examples on the issue of creating secondary indexes. I am attaching the coprocessor to single table only, like this:
disable 'Entry2'
alter 'Entry2', METHOD => 'table_att', 'COPROCESSOR' => '/home/user/hbase/rootdir/hcoprocessors.jar|com.acme.hobservers.EntryParentIndex||'
enable 'Entry2'
The source code of it, is as follows:
public class EntryParentIndex extends BaseRegionObserver{
private static final Log LOG = LogFactory.getLog(CoprocessorHost.class);
private HTablePool pool = null;
private final static String INDEX_TABLE = "EntryParentIndex";
private final static String SOURCE_TABLE = "Entry2";
#Override
public void start(CoprocessorEnvironment env) throws IOException {
pool = new HTablePool(env.getConfiguration(), 10);
}
#Override
public void prePut(
final ObserverContext<RegionCoprocessorEnvironment> observerContext,
final Put put,
final WALEdit edit,
final boolean writeToWAL)
throws IOException {
try {
final List<KeyValue> filteredList = put.get(Bytes.toBytes ("data"),Bytes.toBytes("parentId"));
byte [] id = put.getRow(); //Get the Entry ID
KeyValue kv=filteredList.get( 0 ); //get Entry PARENT_ID
byte[] parentId=kv.getValue();
HTableInterface htbl = pool.getTable(Bytes.toBytes(INDEX_TABLE));
//create row key for the index table
byte[] p1=concatTwoByteArrays(parentId, ":".getBytes()); //Insert a semicolon between two UUIDs
byte[] rowkey=concatTwoByteArrays(p1, id);
Put indexput = new Put(rowkey);
//The following call is setting up a strange? recursion, resulting
//...in thesame prePut method invoken again and again. Interestingly
//...the recursion limits itself up to 6 times. The actual row does not
//...get inserted into the INDEX_TABLE
htbl.put(indexput);
htbl.close();
}
catch ( IllegalArgumentException ex) { }
}
#Override
public void stop(CoprocessorEnvironment env) throws IOException {
pool.close();
}
public static final byte[] concatTwoByteArrays(byte[] first, byte[] second) {
byte[] result = Arrays.copyOf(first, first.length + second.length);
System.arraycopy(second, 0, result, first.length, second.length);
return result;
}
}
This executes when I perform put on the SOURCE_TABLE.
There is a comment in the code (please seek it): "The following call is setting up a strange".
I set a debugging print in the log confirming that the prePut method is being executed only on the SOURCE_TABLE, and never on the INDEX_TABLE. Yet I don't understand why this strange recursion is happening despite in the coprocessor I only execute one put on the INDEX_TABLE.
I have also confirmed that the put action on the source table is again only one.
I have fixed my problem. It came out to be that I was adding multiple times the same observer mistakenly thinking that it is getting lost after Hbase restart.
Also the reason why the .put call to the INDEX_TABLE was not working is because I did not set a value to it, but only a rowkey, mistakenly thinking that this is possible. Yet HBase did not throw any excepiton whatsoever, just did not perform the PUT, no info given, which may be confusing for newcommers to this technology.

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

NHibernate Validator: Using Attributes vs. Using ValidationDefs

I've been using NH Validator for some time, mostly through ValidationDefs, but I'm still not sure about two things:
Is there any special benefit of using ValidationDef for simple/standard validations (like NotNull, MaxLength etc)?
I'm worried about the fact that those two methods throw different kinds of exceptions on validation, for example:
ValidationDef's Define.NotNullable() throws PropertyValueException
When using [NotNull] attribute, an InvalidStateException is thrown.
This makes me think mixing these two approaches isn't a good idea - it will be very difficult to handle validation exceptions consistently. Any suggestions/recommendations?
ValidationDef is probably more suitable for business-rules validation even if, having said that, I used it even for simple validation. There's more here.
What I like about ValidationDef is the fact that it has got a fluent interface.
I've been playing around with this engine for quite a while and I've put together something that works quite well for me.
I've defined an interface:
public interface IValidationEngine
{
bool IsValid(Entity entity);
IList<Validation.IBrokenRule> Validate(Entity entity);
}
Which is implemented in my validation engine:
public class ValidationEngine : Validation.IValidationEngine
{
private NHibernate.Validator.Engine.ValidatorEngine _Validator;
public ValidationEngine()
{
var vtor = new NHibernate.Validator.Engine.ValidatorEngine();
var configuration = new FluentConfiguration();
configuration
.SetDefaultValidatorMode(ValidatorMode.UseExternal)
.Register<Data.NH.Validation.User, Domain.User>()
.Register<Data.NH.Validation.Company, Domain.Company>()
.Register<Data.NH.Validation.PlanType, Domain.PlanType>();
vtor.Configure(configuration);
this._Validator = vtor;
}
public bool IsValid(DomainModel.Entity entity)
{
return (this._Validator.IsValid(entity));
}
public IList<Validation.IBrokenRule> Validate(DomainModel.Entity entity)
{
var Values = new List<Validation.IBrokenRule>();
NHibernate.Validator.Engine.InvalidValue[] values = this._Validator.Validate(entity);
if (values.Length > 0)
{
foreach (var value in values)
{
Values.Add(
new Validation.BrokenRule()
{
// Entity = value.Entity as BpReminders.Data.DomainModel.Entity,
// EntityType = value.EntityType,
EntityTypeName = value.EntityType.Name,
Message = value.Message,
PropertyName = value.PropertyName,
PropertyPath = value.PropertyPath,
// RootEntity = value.RootEntity as DomainModel.Entity,
Value = value.Value
});
}
}
return (Values);
}
}
I plug all my domain rules in there.
I bootstrap the engine at the app startup:
For<Validation.IValidationEngine>()
.Singleton()
.Use<Validation.ValidationEngine>();
Now, when I need to validate my entities before save, I just use the engine:
if (!this._ValidationEngine.IsValid(User))
{
BrokenRules = this._ValidationEngine.Validate(User);
}
and return, eventually, the collection of broken rules.