Alfresco: Lucene query by ID returns 2 rows - lucene

I'm using Alfresco 3.4d and imported some nodes as well as created a few with NodeService. Today I noticed that a Lucene query by ID does sometimes return two rows instead of just one. Not all nodes show this kind of behavior.
For example, when I execute the following Lucene query in the Alfresco Node Browser, I get the result shown below: ID:"workspace://SpacesStore/96c0cc27-cb8c-49cf-977d-a966e5c5e9ca"
How is it even possible that a query by ID can return more than one row? I tried rebuilding the Lucene index, but it didn't help. When I delete the node, the query returns 0 rows. What can I do to remove those "ghost" nodes from the query result?

I also ran across this problem and asked the Alfresco support for advice. They told me that it is perfectly normal to have duplicate entries in the lucene ID field and that this is related to whether there is an ANCESTOR present or not. They recommended using the sys:node-uuid field when doing a lucene search for the node's ID, e.g.:
#sys\:node-uuid:f13a21dd-b020-4c70-aa21-1a0e5c89d42b

I've seen this problem since Alfresco 3.2r, but maybe it is even older! I used the Lucene index Viewer "Luke" (http://www.getopt.org/luke/) to check the index directly and I saw that the corrupt index entry contains almost no information. As workaround we combined our search to some basic information like node type or aspect. I will ask a colleague if he has more information about this.

I don't know directly how this is possible but in your 'code' where you retrieve the nodes you could always do: if node.isDocument or node.isContainer to get true result or type is cm:content or cm:folder.
You could also try to re-index, but I doubt that will be of any help

Related

Why is typesense returning only few of the matches in the index?

I have typesense server running and working. However I have found strange behaviour. There are documents in index with string column called "product_number". And there are product numbers in format "BKP001", "BKP002", .... "BKP999". There is 60 such documents.
However, when I query typesense and search for "BKP", it finds only 4 random matching documents. Strange is that if I do more specific search for "BKP01", it returns 4 documents "BKP010", "BKP011", "BKP012" and "BKP013".
And when I search for "BKP03" then it returns 4 documents "BKP031", "BKP032", "BKP033" and "BKP034". So it is clear that all the documents are correctly in index.
What could be the reason why it doesn't find all the documents ?
When there are several possible prefix matches for a particular keyword, Typesense limits the number of results picked for performance reasons, until more of the keyword is typed. If you want all results to be returned, you want to add max_candidates=1000 and exhaustive_search=true as search parameters.
You also want to make sure that you're using the latest version of Typesense as relevance improvements are regularly released with each version.

Ldap search for objects where attribute X contains multiple values

I would like to know if it is possible to do a search like this:
"give me all objects where description has more than 1 value"
The short answer is no. At least not from a single LDAP Query without somehow parsing the results.
I know of a tool that will provide those results however it has not been updated in a while but last time I used it, it worked.

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

Why are my Lucene Document results empty?

I'm running a simple test--trying to index something and then search for it. I index a simple document, but then when a search for a string in it, I get back what looks to be an empty document (it has no fields). Lucene seems to be doing something, because if I search for a word that's not in the document, it returns 0 results.
Any reason why Lucene would reliably return a document when it finds one that matches the given query, and yet that document has nothing in it?
More details:
I'm actually running Lucandra (Lucene + Cassandra). That certainly may be a relevant detail, but not sure.
The fields are set to Field.Store/YES and Field.Index/ANALYZED
Interestingly, I'm able to get this to work just fine on my local machine, but when we put it on our main server (which is a multi-node cassandra setup), I get the behavior described above. So this seems like probably the relevant detail, but unfortunately, I see no error message to clue me in to what specifically is causing it.
Unsure if this will work with Lucandra, but you have tried opening the index using Luke? Viewing the index contents with Luke might help
It's hard to tell what the problem is since you only provide a very abstract description. However, it sounds a bit like you are not storing the field value in the index. There are different modes for indexing a field. One option determines whether the original value is stored in the index to retrieve it later:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/document/Field.Store.html
See also the description of the enclosing class Field
Read: http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/