How to populate a lucene 5.3 index - lucene

I have written the following class to populate a Lucene Index. I want to build an Index for Lucene so that I can query for specific documents. Unfortunately my documents are not added to the index.
Here is my code:
public class LuceneIndexer {
private IndexWriter indexWriter;
private IndexReader indexReader;
public LuceneIndexer() throws Exception {
Directory indexDir = FSDirectory.open(Paths.get("./index-directory"));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
config.setCommitOnClose(true);
config.setOpenMode(OpenMode.CREATE);
this.indexWriter = new IndexWriter(indexDir, config);
indexReader = DirectoryReader.open(this.indexWriter, true);
}
public void indexRelation(String subject, String description, String object) throws IOException {
System.out.println("Indexing relation between: " + subject+" and "+object);
Document doc = new Document();
doc.add(new TextField("subject", subject, Field.Store.YES));
doc.add(new TextField("description", description, Field.Store.YES));
doc.add(new TextField("object", object, Field.Store.YES));
indexWriter.addDocument(doc);
}
public void commit() throws Exception {
indexWriter.commit();
}
public int getNumberOfRelations() {
return indexReader.numDocs();
}
}
I am trying to get the following testcase to pass:
public class LuceneIndexerTest {
private LuceneIndexer instance;
#Before
public void setUp() throws SQLException, IOException {
instance = new LuceneIndexer();
instance.indexRelation("subject1","descr1","object1");
instance.indexRelation("subject2","descr2","object2");
instance.indexRelation("subject3","descr3","object3");
instance.commit();
}
#After
public void tearDown() throws IOException {
instance.close();
}
#Test
public void testIndexing() {
Assert.assertEquals(3, instance.getNumberOfRelations());
Assert.assertEquals(3, instance.getNumberOfRelations("subject"));
}
Unfortunately the Testcase says there are 0 documents in the index.

From Lucene's javadoc: "Any changes made to the index via IndexWriter will not be visible until a new IndexReader is opened".
The indexReader keep a view on your index at the time the IndexReader object was created. Just create a new one after each commit, and your indexReader will work as expected.
Here is the fix for your LuceneIndexer class:
public void commit() throws Exception {
indexWriter.commit();
if (indexReader != null)
indexReader.close();
indexReader = DirectoryReader.open(this.indexWriter, true);
}

Related

Add weights to documents Lucene8+solr 8 while indexing

I am working on migrating solr from 5.4.3 to 8.11 for one of my search apps and successfully upgraded to 7.7.3. But for further upgradations facing the order of the response data being changed than it was earlier. Here I am trying to use FunctionScoreQuery along with DoubleValuesSource since CustomScoreQuery is deprecated in 7.7.3 and removed in 8.
Below is my code snippet (now I am using solr 8.5.2 and Lucene 8.5.2)
public class CustomQueryParser extends QParserPlugin {
#Override
public QParser createParser(final String qstr, final SolrParams localParams, final SolrParams params,
final SolrQueryRequest req) {
return new MyParser(qstr, localParams, params, req);
}
private static class MyParser extends QParser {
private Query innerQuery;
private String queryString;
public MyParser(final String qstr, final SolrParams localParams, final SolrParams params,
final SolrQueryRequest req) {
super(qstr, localParams, params, req);
if (qstr == null || qstr.trim().length() == 0) {
this.queryString = DEFAULT_SEARCH_QUERY;
setString(this.queryString);
} else {
this.queryString = qstr;
}
try {
if (queryString.contains(":")) {
final QParser parser = getParser(queryString, "edismax", getReq());
this.innerQuery = parser.parse();
} else {
final QParser parser = getParser(queryString, "dismax", getReq());
this.innerQuery = parser.parse();
}
} catch (final SyntaxError ex) {
throw new RuntimeException("Error parsing query", ex);
}
}
#Override
public Query parse() throws SyntaxError{
final Query query = new MyCustomQuery(innerQuery);
final CustomValuesSource customValuesSource = new CustomValuesSource(queryString,innerQuery);
final FunctionScoreQuery fsq = FunctionScoreQuery.boostByValue(query, customValuesSource.fromFloatField("score"));
return fsq;
}
}
}
public class MyCustomQuery extends Query {
#Override
public Weight createWeight(final IndexSearcher searcher, final ScoreMode scoreMode, final float boost) throws IOException {
Weight weight;
if(query == null){
weight = new ConstantScoreWeight(this, boost) {
#Override
public Scorer scorer(final LeafReaderContext context) throws IOException {
return new ConstantScoreScorer(this,score(),scoreMode, DocIdSetIterator.all(context.reader().maxDoc()));
}
#Override
public boolean isCacheable(final LeafReaderContext leafReaderContext) {
return false;
}
};
}else {
weight = searcher.createWeight(query,scoreMode,boost);
}
return weight;
}
}
public class CustomValuesSource extends DoubleValuesSource {
#Override
public DoubleValues getValues(final LeafReaderContext leafReaderContext,final DoubleValues doubleValues) throws IOException {
final DoubleValues dv = new CustomDoubleValues(leafReaderContext);
return dv;
}
class CustomDoubleValues extends DoubleValues {
#Override
public boolean advanceExact(final int doc) throws IOException {
final Document document = leafReaderContext.reader().document(doc);
final List<IndexableField> fields = document.getFields();
for (final IndexableField field : fields) {
// total_score is being calculated with my own preferences
document.add(new FloatDocValuesField("score",total_score));
//can we include the **score** here?
this custom logic which includes score is not even calling.
}
}
}
I am trying for a long time but have not found a single working example. Can anybody help me and save me here.
Thank you,
Syamala.

NiFi custom processor won't publish data after sometime to next processor

I have written the redis data enricher which will get the rules and timeout key from the redis based on the macid but it is working for sometime and after sometime it is not sending flowfile to next processor(it will be in running state but won't send flowfile to next processor). Nifi is working in cluster mode is there any thing wrong in the below processor(RedisDataEnricher).
In the below code I am taking redis connection only once and after that I'm using the same connection for fecthing data from redis.
public class RedisDataEnricher extends AbstractProcessor {
private volatile Jedis jedis;
public static final PropertyDescriptor ConnectionHost = new PropertyDescriptor
.Builder().name("ConnectionHost")
.displayName("ConnectionHost")
.description("ConnectionHost")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
public static final PropertyDescriptor ConnectionPort = new PropertyDescriptor
.Builder().name("ConnectionPort")
.displayName("ConnectionPort")
.description("ConnectionPort")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
public static final PropertyDescriptor JSONKEY = new PropertyDescriptor
.Builder().name("JSONKEY")
.displayName("JSONKEY")
.description("JSON key to be fetched from input")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
public static final PropertyDescriptor MACIDKEY = new PropertyDescriptor
.Builder().name("MACIDKEY")
.displayName("MACIDKEY")
.description("MACIDKEY to be fetched from input")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
public static final Relationship SUCCESS = new Relationship.Builder()
.name("SUCCESS")
.description("SUCCESS")
.build();
public static final Relationship FAILURE = new Relationship.Builder()
.name("FAILURE")
.description("FAILURE")
.build();
private List<PropertyDescriptor> descriptors;
private Set<Relationship> relationships;
#Override
protected void init(final ProcessorInitializationContext context) {
final List<PropertyDescriptor> descriptors = new ArrayList<PropertyDescriptor>();
descriptors.add(JSONKEY);
descriptors.add(MACIDKEY);
descriptors.add(ConnectionHost);
descriptors.add(ConnectionPort);
this.descriptors = Collections.unmodifiableList(descriptors);
final Set<Relationship> relationships = new HashSet<Relationship>();
relationships.add(SUCCESS);
relationships.add(FAILURE);
this.relationships = Collections.unmodifiableSet(relationships);
}
#Override
public Set<Relationship> getRelationships() {
return this.relationships;
}
#Override
public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
return descriptors;
}
#OnScheduled
public void onScheduled(final ProcessContext context) {
try {
jedis = new Jedis(context.getProperty("ConnectionHost").toString(),Integer.parseInt(context.getProperty("ConnectionPort").toString()));
} catch (Exception e) {
getLogger().error("Unable to establish Redis connection.");
}
}
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if ( flowFile == null ) {
return;
}
else{
try{
InputStream inputStream = session.read(flowFile);
StringWriter writer = new StringWriter();
IOUtils.copy(inputStream, writer, "UTF-8");
Jedis jedis1=jedis;
JSONObject json=new JSONObject(writer.toString());
inputStream.close();
JSONObject json1=new JSONObject();
String rules=jedis1.hget(json.getJSONObject(context.getProperty("JSONKEY").toString()).getString(context.getProperty("MACIDKEY").toString()), "rules");
json1.put("data", json.getJSONObject(context.getProperty("JSONKEY").toString()));
json1.put("timeOut", jedis1.hget(json.getJSONObject(context.getProperty("JSONKEY").toString()).getString(context.getProperty("MACIDKEY").toString()),"timeOut"));
json1.put("rules", rules!=null?new ArrayList<String>(Arrays.asList(rules.split(" , "))):new ArrayList<>());
flowFile = session.write(flowFile, new OutputStreamCallback() {
#Override
public void process(OutputStream out) throws IOException {
out.write(json1.toString().getBytes());
}
});
flowFile = session.putAttribute(flowFile, "OutBound", jedis1.hget(json.getJSONObject(context.getProperty("JSONKEY").toString()).getString(context.getProperty("MACIDKEY").toString()),"OutBound"));
session.transfer(flowFile, SUCCESS);
}
catch(Exception e)
{
session.transfer(flowFile, FAILURE);
}
}
}
}

How to do failure tolerance for Flink to sink data to hdfs as gzip compression?

We want to write compressed data to HDFS by Flink's BucketingSink or StreamingFileSink. I have write my own Writer which works fine if no failure occurs. However when It encounters a failure and restart from checkpoint, It will generate valid-length file(hadoop < 2.7) or truncate the file. Unluckily gzips are binary files which have trailer at the end of file. Therefore simple truncation does not work in my case. Any ideas to enable exactly-once semantic for compression hdfs sink?
That's my writer's code:
public class HdfsCompressStringWriter extends StreamWriterBaseV2<JSONObject> {
private static final long serialVersionUID = 2L;
/**
* The {#code CompressFSDataOutputStream} for the current part file.
*/
private transient GZIPOutputStream compressionOutputStream;
public HdfsCompressStringWriter() {}
#Override
public void open(FileSystem fs, Path path) throws IOException {
super.open(fs, path);
this.setSyncOnFlush(true);
compressionOutputStream = new GZIPOutputStream(this.getStream(), true);
}
public void close() throws IOException {
if (compressionOutputStream != null) {
compressionOutputStream.close();
compressionOutputStream = null;
}
resetStream();
}
#Override
public void write(JSONObject element) throws IOException {
if (element == null || !element.containsKey("body")) {
return;
}
String content = element.getString("body") + "\n";
compressionOutputStream.write(content.getBytes());
compressionOutputStream.flush();
}
#Override
public Writer<JSONObject> duplicate() {
return new HdfsCompressStringWriter();
}
}
I would recommend to implement a BulkWriter for the StreamingFileSink which compresses the elements via a GZIPOutputStream. The code could look the following:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(1000);
final DataStream<Integer> input = env.addSource(new InfinitySource());
final StreamingFileSink<Integer> streamingFileSink = StreamingFileSink.<Integer>forBulkFormat(new Path("output"), new GzipBulkWriterFactory<>()).build();
input.addSink(streamingFileSink);
env.execute();
}
private static class GzipBulkWriterFactory<T> implements BulkWriter.Factory<T> {
#Override
public BulkWriter<T> create(FSDataOutputStream fsDataOutputStream) throws IOException {
final GZIPOutputStream gzipOutputStream = new GZIPOutputStream(fsDataOutputStream, true);
return new GzipBulkWriter<>(new ObjectOutputStream(gzipOutputStream), gzipOutputStream);
}
}
private static class GzipBulkWriter<T> implements BulkWriter<T> {
private final GZIPOutputStream gzipOutputStream;
private final ObjectOutputStream objectOutputStream;
public GzipBulkWriter(ObjectOutputStream objectOutputStream, GZIPOutputStream gzipOutputStream) {
this.gzipOutputStream = gzipOutputStream;
this.objectOutputStream = objectOutputStream;
}
#Override
public void addElement(T t) throws IOException {
objectOutputStream.writeObject(t);
}
#Override
public void flush() throws IOException {
objectOutputStream.flush();
}
#Override
public void finish() throws IOException {
objectOutputStream.flush();
gzipOutputStream.finish();
}
}

lucene query syntax for having both the specified words

I am beginner to lucene.
I have a field name fstname in document.
How can I retrieve documents having both the words "vamshi" and "sai" in the fstname field?
public class Indexer
{
public Indexer() {}
private IndexWriter indexWriter = null;
public IndexWriter getIndexWriter(boolean create) throws IOException
{
if (indexWriter == null)
{
File file=new File("D:/index-directory");
Path dirPath = file.toPath();
Directory indexDir = FSDirectory.open(file);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2,new StandardAnalyzer());
indexWriter = new IndexWriter(indexDir, config);
}
return indexWriter;
}
public void closeIndexWriter() throws IOException
{
if (indexWriter != null)
{
indexWriter.close();
}
}
public void indexHotel(Hotel hotel) throws IOException
{
IndexWriter writer = getIndexWriter(false);
Document doc = new Document();
doc.add(new StringField("id", hotel.getId(), Field.Store.YES));
doc.add(new StringField("fstname", hotel.getFstname(), Field.Store.YES));
doc.add(new StringField("lastname", hotel.getLastname(), Field.Store.YES));
doc.add(new LongField("mobileno", hotel.getMobileno(), Field.Store.YES));
String fullSearchableText = hotel.getId()+" "+hotel.getFstname()+ " " + hotel.getLastname() + " " + hotel.getMobileno();
doc.add(new TextField("content", fullSearchableText, Field.Store.NO));
writer.addDocument(doc);
}
public void rebuildIndexes() throws IOException
{
getIndexWriter(true);
indexWriter.deleteAll();
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel : hotels)
{
indexHotel(hotel);
}
closeIndexWriter();
}
}
public class SearchEngine
{
private IndexSearcher searcher = null;
private QueryParser parser = null;
/** Creates a new instance of SearchEngine */
public SearchEngine() throws IOException
{
File file=new File("D:/index-directory");
Path dirPath = file.toPath();
searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(file)));
parser = new QueryParser("content", new StandardAnalyzer());
}
public TopDocs performSearch(String queryString, int n)
throws IOException, ParseException
{
Query query = parser.parse(queryString);
return searcher.search(query, n);
}
public Document getDocument(int docId)
throws IOException {
return searcher.doc(docId);
}
}
public class HotelDatabase
{
private static final Hotel[] HOTELS = {
new Hotel("1","vamshi","chinta",9158191135L),
new Hotel("2","vamshi krishna","chinta",9158191136L),
new Hotel("3","krishna","chinta",9158191137L),
new Hotel("4","vamshi","something",9158191138L),
new Hotel("5","venky","abc",123456789L),
new Hotel("6","churukoti","def",123456789L),
new Hotel("7","chinta","vamshi",9158191139L),
new Hotel("8","chinta","krishna vamshi",9158191139L),
};
public static Hotel[] getHotels() {
return HOTELS;
}
public static Hotel getHotel(String id) {
for(Hotel hotel : HOTELS) {
if (id.equals(hotel.getId())) {
return hotel;
}
}
return null;
}
}
public class Hotel
{
private String fstname;
private String lastname;
private long mobileno;
private String id;
public void setMobileno(long mobileno) {
this.mobileno = mobileno;
}
public Hotel()
{
}
public Hotel(String id,
String fstname,
String lastname,
Long mobileno) {
this.id = id;
this.fstname = fstname;
this.lastname = lastname;
this.mobileno = mobileno;
}
public String getFstname() {
return fstname;
}
public void setFstname(String fstname) {
this.fstname = fstname;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getLastname() {
return lastname;
}
public void setLastname(String lastname) {
this.lastname = lastname;
}
public long getMobileno() {
return mobileno;
}
public void setMobileno(int mobileno) {
this.mobileno = mobileno;
}
public String toString() {
return "Hotel "
+ getId()
+": "
+ getFstname()
+" ("
+ getLastname()
+")";
}
}
now when i search with the query
TopDocs topDocs=new SearchEngine().performSearch("fstname:vamshi AND fstname:krishna", 100);
it is not returning document with fstname as "vamshi krishna"
what is the problem in my code??
This is a simple boolean AND query:
fstname:vamshi AND fstname:sai
The StandardQueryParser will translate this into the query:
+fstname:vamshi +fstname:sai
Edit:
There is one problem in your code. You are using StringFields to store the hotel names. However StringFields are only indexed but not tokenized. (see here) That means that they are not being broken down into the individual tokens. If you add "vamshi krishna" then this is not being tokenized into "vamshi" and "krishna" but just stored as "vamshi krishna".
Try using a regular TextField and it should work.
Try using these:
fstname:vamshi*sai OR fstname:sai*vamshi
As you can see this is searching for text pattern. This probably will gonna have performance issues.
Try looking here for more information.

Full text search on Neo4j over rich text with html markup

In my Neo4j application I have a Product entity with a name and description fields. Both of these fields are used in legacy indexing over Lucene.
Product.name is a simple text and there are no issues here but Product.description can contain HTML markup and elements.
Right now for my index I use StandardAnalyzer(Version.LUCENE_36). What analyzer should I use in order to skip all HTML elements ?
How to tell Neo4J Lucene index to not use any HTML elements in Product.description ? I'd like to index only words.
UPDATED:
I have found following class HTMLStripCharFilter and reimplemented my Analyzer as following:
public final class StandardAnalyzerV36 extends Analyzer {
private Analyzer analyzer;
public StandardAnalyzerV36() {
analyzer = new StandardAnalyzer(Version.LUCENE_36);
}
public StandardAnalyzerV36(Set<?> stopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_36, stopWords);
}
#Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
return analyzer.tokenStream(fieldName, new HTMLStripCharFilter(CharReader.get(reader)));
}
#Override
public final TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
return analyzer.reusableTokenStream(fieldName, reader);
}
}
also I have added a new maven dependecy to my Neo4j project:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers</artifactId>
<version>3.6.2</version>
</dependency>
Everything works fine right now, but I'm not sure that the method
#Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
return analyzer.tokenStream(fieldName, new HTMLStripCharFilter(CharReader.get(reader)));
}
is a proper place for HTMLStripCharFilter initialization.
Please correct me if I'm wrong.
I have added following init method:
#PostConstruct
public void init() {
GraphDatabaseService graphDb = template.getGraphDatabaseService();
try (Transaction t = graphDb.beginTx()) {
Index<Node> autoIndex = graphDb.index().forNodes("node_auto_index");
graphDb.index().setConfiguration(autoIndex, "type", "fulltext");
graphDb.index().setConfiguration(autoIndex, "to_lower_case", "true");
graphDb.index().setConfiguration(autoIndex, "analyzer", StandardAnalyzerV36.class.getName());
t.success();
}
}
and created following class:
public final class StandardAnalyzerV36 extends Analyzer {
private Analyzer analyzer;
public StandardAnalyzerV36() {
analyzer = new StandardAnalyzer(Version.LUCENE_36);
}
public StandardAnalyzerV36(Set<?> stopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_36, stopWords);
}
#Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
return analyzer.tokenStream(fieldName, new HTMLStripCharFilter(CharReader.get(reader)));
}
#Override
public final TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
return analyzer.reusableTokenStream(fieldName, reader);
}
}
Now everything works as expected. Hope it will help someone else. Good luck.