Can't get search for modification time to work - lucene

The example in package org.apache.lucene.demo works for text search.
But I can't get it to work using and displaying modification time.
It seems that the field modified is handled but no success using it.
Running SearchFiles prints hits for
Enter query:
+kompl*
but nothing here
+kompl* +modified:[0 TO 9999999999999]
Can someone provide an example for this?

I had the wrong assumption that file attributes are somehow implicitly available to me.
But ok, I had to do it by myself.
For indexing I added a simple integer
// provide stored date integer to query for [yyyymmdd]
Date dt = new Date(lastModified);
int myDays = (dt.getYear()+1900)*100*100 + (dt.getMonth()+1)*100 + dt.getDate();
doc.add(new IntPoint("moddate", myDays ));
doc.add(new StoredField("moddateVal", myDays ));
For searching I handle this field by an extended parser
public static class QueryParserModdate extends QueryParser {
public QueryParserModdate(String f, Analyzer a) {
super(f, a);
}
protected Query getRangeQuery(String field, String part1, String part2,
boolean startInclusive, boolean endInclusive)
throws ParseException {
if (field.equalsIgnoreCase("moddate")) {
int part1Int = Integer.MIN_VALUE;
int part2Int = Integer.MAX_VALUE;
try {
part1Int = Integer.parseInt(part1);
} catch (Exception e) {
...
Query query = IntPoint.newRangeQuery("moddate", part1Int,
part2Int);
return query;
}
return super.getRangeQuery(field, part1, part2, startInclusive,
endInclusive);
}
For sure not beautiful but working for me.

Related

I have synonym matching working EXCEPT in quoted phrases

Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.
For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.

Data is written to BigQuery but not in proper format

I'm writing data to BigQuery and successfully gets written there. But I'm concerned with the format in which it is getting written.
Below is the format in which the data is shown when I execute any query in BigQuery :
Check the first row, the value of SalesComponent is CPS_H but its showing 'BeamRecord [dataValues=[CPS_H' and In the ModelIteration the value is ended with a square braket.
Below is the code that is used to push data to BigQuery from BeamSql:
TableSchema tableSchema = new TableSchema().setFields(ImmutableList.of(
new TableFieldSchema().setName("SalesComponent").setType("STRING").setMode("REQUIRED"),
new TableFieldSchema().setName("DuetoValue").setType("STRING").setMode("REQUIRED"),
new TableFieldSchema().setName("ModelIteration").setType("STRING").setMode("REQUIRED")
));
TableReference tableSpec = BigQueryHelpers.parseTableSpec("beta-194409:data_id1.tables_test");
System.out.println("Start Bigquery");
final_out.apply(MapElements.into(TypeDescriptor.of(TableRow.class)).via(
(MyOutputClass elem) -> new TableRow().set("SalesComponent", elem.SalesComponent).set("DuetoValue", elem.DuetoValue).set("ModelIteration", elem.ModelIteration)))
.apply(BigQueryIO.writeTableRows()
.to(tableSpec)
.withSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
p.run().waitUntilFinish();
EDIT
I have transformed BeamRecord into MyOutputClass type using below code and this also doesn't work:
PCollection<MyOutputClass> final_out = join_query.apply(ParDo.of(new DoFn<BeamRecord, MyOutputClass>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
BeamRecord record = c.element();
String[] strArr = record.toString().split(",");
MyOutputClass moc = new MyOutputClass();
moc.setSalesComponent(strArr[0]);
moc.setDuetoValue(strArr[1]);
moc.setModelIteration(strArr[2]);
c.output(moc);
}
}));
It looks like your MyOutputClass is constructed incorrectly (with incorrect values). If you look at it, BigQueryIO is able to create rows with correct fields just fine. But those fields have wrong values. Which means that when you call .set("SalesComponent", elem.SalesComponent) you already have incorrect data in the elem.
My guess is the problem is in some previous step, when you convert from BeamRecord to MyOutputClass. You would get a result similar to what you're seeing if you did something like this (or some other conversion logic did this for you behind the scenes):
convert BeamRecord to string by calling beamRecord.toString();
if you look at BeamRecord.toString() implementation you can see that you're getting exactly that string format;
split this string by , getting an array of strings;
construct MyOutputClass from that array;
Pseudocode for this is something like:
PCollection<MyOutputClass> final_out =
beamRecords
.apply(
ParDo.of(new DoFn() {
#ProcessElement
void processElement(Context c) {
BeamRecord record = c.elem();
String[] fields = record.toString().split(",");
MyOutputClass elem = new MyOutputClass();
elem.SalesComponent = fields[0];
elem.DuetoValue = fields[1];
...
c.output(elem);
}
})
);
Correct way of doing something like this is to call getters on the record instead of splitting its string representation, along these lines (pseudocode):
PCollection<MyOutputClass> final_out =
beamRecords
.apply(
ParDo.of(new DoFn() {
#ProcessElement
void processElement(Context c) {
BeamRecord record = c.elem();
MyOutputClass elem = new MyOutputClass();
//get field value by name
elem.SalesComponent = record.getString("CPS_H...");
// get another field value by name
elem.DuetoValue = record.getInteger("...");
...
c.output(elem);
}
})
);
You can verify something like this by adding a simple ParDo where you either put a breakpoint and look at the elements in the debugger, or output the elements somewhere else (e.g. console).
I was able to resolve this issue using below methods :
PCollection<MyOutputClass> final_out = record40.apply(ParDo.of(new DoFn<BeamRecord, MyOutputClass>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
BeamRecord record = c.element();
String strArr = record.toString();
String strArr1 = strArr.substring(24);
String xyz = strArr1.replace("]","");
String[] strArr2 = xyz.split(",");

lucene 4.x user-defined Filter, the DocsEnum in getDocIdSet(....) returns null and fail to work

I make use of lucene 4.0 to build my search engine. I need to define a Filter when searching. The filter code like this will work fine:
public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs)
throws IOException {
String[] target_real_names = {"eMule"};
OpenBitSet obs = new OpenBitSet(context.reader().maxDoc());
for(String target_real_name : target_real_names){
TermQuery query=new TermQuery(new Term(Fields.PROJECT_REAL_NAME,target_real_name));
IndexSearcher indexSearcher=new IndexSearcher(context.reader());
TopDocs docs=indexSearcher.search(query,context.reader().maxDoc());
ScoreDoc[] scoreDocs=docs.scoreDocs;
if (scoreDocs.length==1) {
obs.set(scoreDocs[0].doc);
}
}
return obs;
}
but the code like this fail to work:
public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs)
throws IOException {
OpenBitSet obs = new OpenBitSet(context.reader().maxDoc());
String[] target_real_names = {"eMule"};
for(String target_real_name : target_real_names){
DocsEnum de = context.reader().termDocsEnum(new Term(Fields.PROJECT_REAL_NAME, target_real_name));
if(de.nextDoc()!= -1){
obs.set((long)de.docID());
}
}
return obs;
}
In this piece of code, the de will be null, I don't know why. Any one can help me?
Look at the javadoc of termDocsEnum() --> This will return null if either the field or term does not exist.
This means that it's totally normal that de is null when your term target_real_name does not exist.

gwt html5 database SELECT WHERE AND

iam using http://code.google.com/p/gwt-mobile-webkit/downloads/list?q=label:API-Database
for my project.
there are two textboxes and the user can type in two words.
before these two words will be saved in html database i am looking if these two words are already in the the database saved.
the problem is i dont know how should look like the correct syntax for the
SELECT * FROM words WHERE word1=value1 AND word2=value2.
final String word1 = view.getWord1TextBox().getValue();
final String word2 = view.getWord2TextBox().getValue();
if(Database.isSupported()){
Database db = Database.openDatabase("Words", "1.0", "Word App", 10000);
db.transaction(new TransactionCallback() {
public void onTransactionStart(SQLTransaction tx) {
tx.executeSql("SELECT * FROM words WHERE word1=(?) AND word2=(?)", new Object[]{word1, word2}, new StatementCallback<JavaScriptObject>() {
public void onSuccess(SQLTransaction transaction, SQLResultSet<JavaScriptObject> resultSet) {
System.out.println(" count of rows : "+ resultSet.getRowsAffected());
}
public boolean onFailure(SQLTransaction transaction,SQLError error) {
return false;
}
});
}
public void onTransactionSuccess() {
}
public void onTransactionFailure(SQLError error) {
//setTextMessageLabel("Could not open database", label );
}
});
}
the resultSet.getRowsAffected() is everytime = 0
how should look like the correkt query?
or is there another problem?
please help
How do you know your insert worked?
Does "SELECT * FROM words" return anything?
I use http://code.google.com/p/gwt-mobile-webkit/wiki/DataServiceUserGuide so my syntax is a little different but using the developer tools in chrome you can find the database and run sqlite command in it.
Doing that, this gives me an error
SELECT * from lesson where desc = (summer)
while using quotes works
SELECT * from lesson where desc = ("summer")
SELECT * from lesson where desc = "summer"
You may need to escape it in your code...
"SELECT * FROM words WHERE word1=(\"?\") AND word2=(\"?\")"

How to handle null pointer exceptions in elasticsearch

I'm using elasticsearch and i was trying to handle the case when the database is empty
#SuppressWarnings("unchecked")
public <M extends Model> SearchResults<M> findPage(int page, String search, String searchFields, String orderBy, String order, String where) {
BoolQueryBuilder qb = buildQueryBuilder(search, searchFields, where);
Query<M> query = (Query<M>) ElasticSearch.query(qb, entityClass);
// FIXME Currently we ignore the orderBy and order fields
query.from((page - 1) * getPageSize()).size(getPageSize());
query.hydrate(true);
return query.fetch();
}
the error at return query.fetch();
i'm trying to implement a try and catch statement but it's not working, any one can help with this please?