Noise Removal using Lucene 4.8 - lucene

I couldn't find any examples of removing stop-words from text using Lucene 4.8. Can you please advise me on how to use the classes StopFilter and StopAnalyzer classes to achieve this.

Two of three StandardAnalyzer constructors allow specifying stopwords; just use any of those. This analyzer uses StopFilter underneath, you don't have to do anything extra.

Related

How to set custom name, suffix for mapper files and interfaces in mybatis generator?

Can you set custom suffix and naming rule mapper xml and interfaces in MyBatis Generator (MBG)?
For example, When generating mapper files for class Book. MBG generates mapper file BookMapper.xml and interface PartnerDao.java. However, I wish to change the suffix to something else, like BookMapperBase.xml or BookDaoBase.xml, and PartnerMapperBase.java or PartnerDaoBase.java.
The reason is, former colleagues were using BookMapper.xml for their hand-written sql statements and using the same name would cause confusion. Moreover, I do not wish to use generated mappers directly, but use custom mapper files that extend BookMapperBase.xml.
I have searched online and found some github projects and hot rod ORM, but is it really not supported by official Mybatis Generator? If not, what is your recommended alternative?
There are a couple of options.
You could use a domain object renaming rule as documented here: http://www.mybatis.org/generator/configreference/domainObjectRenamingRule.html
If that doesn't work the way you want it to, you could write a MyBatis Generator plugin to change the names of the generated artifacts. There is an example here: https://github.com/mybatis/generator/blob/master/core/mybatis-generator-core/src/main/java/org/mybatis/generator/plugins/RenameExampleClassPlugin.java

Is specific Lucene classes are intended to be consumed by applications?

I'm new to the Apache Lucene library. I'd like to directly consume a class in this library called: LevenshteinDistance to calculated similarity search between strings. Would that be correct for my own application to directly consume it, or should I go thru the Lucene api?
Just using that single class is totally ok, but if you just need that you should take the soure code of that class, remove unneded Lucene dependencies, and use it. Lucene is a huge thing and you don't want to have it in your project if you only needs to compute a string distance.
One thing: In the source code for LevenshteinDistance.java there's a comment mentioning that the code was taken from Apache Commons "StringUtils" clas. Maybe you should just add that. It's here: https://commons.apache.org/proper/commons-lang/

Removing numerous annotations automatically in IntelliJ

I have a package that contains lots of classes, each having lots of annotations. Is there any way to delete all the annotations automatically, rather than manually deleting them one by one?
I'm using the latest version of IntelliJ IDEA. I cannot use search and replace, because there are a lot of different annotations.
IntelliJ has a feature called Structural Search and Replace. You could use it to find all annotations and replace with nothing. I have never really used this feature so can't offer you the exact search you need to use. The best I can offer is a link to the documentation for this feature. I am sure this feature can do what you want though:
http://www.jetbrains.com/idea/webhelp/structural-search-and-replace.html

How do I handle special characters like(#) in OpenSearchServer / Lucene?

I am using OpenSearchServer (community edition) v1.2.4-rc3 - stable - rev 1474 - build 802. I crawl a C# and C++ programming website. Now when i search for C# or C++ the software strips special characters like #,+. The results are not exact which software returns. How do I handle special characters like(#) in OpenSearchServer / Lucene? Can any one please suggest me idea? Thanks in advance
You need to change your indexing strategy to use a custom or semi-custom tokenizer that preserves the special characters you need to represent C# and C++ code terms. You would use this tokenizer both during indexing and during searching.
Off-hand, I would look at org.apache.lucene.analysis.standard and org.apache.lucene.wikipedia.analysis to get some ideas as how to construct the tokenizer (using a tokenizer (lexical analyzer) generator like JFlex etc. may be called for rather than hand-coding the tokenizer).

missing packages in lucene 4.0 snapshot

anybody knows why there is no QueryParser, nor IndexWriter.MaxFieldLength(25000) and some more in Lucene 4.0 Snapshot?
I'm having hard time to port the code to this newer version, though I'm following the code as given here: http://search-lucene.com/jd/lucene/overview-summary.html
How do I find the missing packages, and how do I get them? As the snapshop jar doesn't contain all the features..
thanks
Lucene has been re-architectured, and some classes which used to be in the core module are now in submodules. You will now find the QueryParser stuff in the queryparser submodule. Similarly, lots of useful analyzers, tokenizers and tokenfilters have been moved to the analysis submodule.
Regarding IndexWriter, the maximum field length option has been deprecated, it is now recommended to wrap an analyzer with LimitTokenCountAnalyzer (in the analysis submodule) instead.