Why are my Lucene Document results empty? - lucene

I'm running a simple test--trying to index something and then search for it. I index a simple document, but then when a search for a string in it, I get back what looks to be an empty document (it has no fields). Lucene seems to be doing something, because if I search for a word that's not in the document, it returns 0 results.
Any reason why Lucene would reliably return a document when it finds one that matches the given query, and yet that document has nothing in it?
More details:
I'm actually running Lucandra (Lucene + Cassandra). That certainly may be a relevant detail, but not sure.
The fields are set to Field.Store/YES and Field.Index/ANALYZED
Interestingly, I'm able to get this to work just fine on my local machine, but when we put it on our main server (which is a multi-node cassandra setup), I get the behavior described above. So this seems like probably the relevant detail, but unfortunately, I see no error message to clue me in to what specifically is causing it.

Unsure if this will work with Lucandra, but you have tried opening the index using Luke? Viewing the index contents with Luke might help

It's hard to tell what the problem is since you only provide a very abstract description. However, it sounds a bit like you are not storing the field value in the index. There are different modes for indexing a field. One option determines whether the original value is stored in the index to retrieve it later:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/document/Field.Store.html
See also the description of the enclosing class Field

Read: http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/

Related

How can I get document hash from documentum using DQL?

Using DQMan or Document Admistrator, what's DQL statement to get hash of document in DCTM?
Select ... ?
If it's not possible how can I get it?
(I know exactly which is the document, r_chronicle_id, r_object_id, etc...)
AFAIK there is no field representing a document hash, but take a look in the dmr_content object table. It should be here if there is one (I haven't checked I several years).
Alternatively you would have to get it with API - either there is a method or you should do it yourself. Take a look in the api guide.
Edit: just searched the object reference guide. Turns out that there is a field in dmr_content. It's called r_content_hash.
Have a look at it to see if your docbase fulfills requirements to have this field populated. Maybe you're in luck ;-)

Lucene Difference between OpenMode.CREATE_OR_APPEND and deleteDocuments

I am pretty new to LUCENE search engine, want to know the functionality of OpenMode.CREATE_OR_APPEND, deleteDocuments? Also, indexSearcher.search method can accept either Term or Query as a parameter, to fetch documents. Can you help me out in which scenario I need to use term and query?
The OpenMode does not affect the behavior of deleteDocuments. It only affects what happens when you open the Indexwriter:
CREATE - Creates a new index. If one already exists, it will be overwritten.
CREATE_OR_APPEND - Uses an existing index, or creates it if none currently exists.
APPEND - Uses an existing index. If none currently exists, throws an IOException.
I'm not aware of any IndexSearcher.search method that takes a Term as an argument. If you can link to what you are referring to, that might be helpful.
However, if you want to search for a term, you can just use TermQuery.

SQL like '%term%' except without letters

I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.
SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.

Alfresco: Lucene query by ID returns 2 rows

I'm using Alfresco 3.4d and imported some nodes as well as created a few with NodeService. Today I noticed that a Lucene query by ID does sometimes return two rows instead of just one. Not all nodes show this kind of behavior.
For example, when I execute the following Lucene query in the Alfresco Node Browser, I get the result shown below: ID:"workspace://SpacesStore/96c0cc27-cb8c-49cf-977d-a966e5c5e9ca"
How is it even possible that a query by ID can return more than one row? I tried rebuilding the Lucene index, but it didn't help. When I delete the node, the query returns 0 rows. What can I do to remove those "ghost" nodes from the query result?
I also ran across this problem and asked the Alfresco support for advice. They told me that it is perfectly normal to have duplicate entries in the lucene ID field and that this is related to whether there is an ANCESTOR present or not. They recommended using the sys:node-uuid field when doing a lucene search for the node's ID, e.g.:
#sys\:node-uuid:f13a21dd-b020-4c70-aa21-1a0e5c89d42b
I've seen this problem since Alfresco 3.2r, but maybe it is even older! I used the Lucene index Viewer "Luke" (http://www.getopt.org/luke/) to check the index directly and I saw that the corrupt index entry contains almost no information. As workaround we combined our search to some basic information like node type or aspect. I will ask a colleague if he has more information about this.
I don't know directly how this is possible but in your 'code' where you retrieve the nodes you could always do: if node.isDocument or node.isContainer to get true result or type is cm:content or cm:folder.
You could also try to re-index, but I doubt that will be of any help

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article