Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting - lucene

I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code.
I would like to use DocValuesField to get values but also for filtering and sorting.
So here I have some questions:
If I'd like to use range filter (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField,
but if i want to use terms filter (FieldCacheTermsFilter) I need to store a value in SortedDocValuesField.
So it looks like if I want to use range and terms filters I need to have two different fields. Am I right? Am I using it correctly?
Another thing is Sort. I can choose between SortedNumericSortField and SortField. First one requires SortedNumericDocValues, another NumericDocValuesField. Is there any(big) difference in performance?
Should I use SortedNumericSortField (adding another field to the index)?
And the last one. Am I right that all corresponding DocValuesField will be removed from index when doc is removed? I saw an IndexWriter method for an update doc value but no delete method for doc value.
Regards
Piotr

Related

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

git_index_get_byindex vs git_status_foreach_ext

It looks like git_index_get_bypath and git_status_foreach_ext (with GIT_STATUS_SHOW_INDEX_ONLY) are just different ways of reading the index. What are the differences, and why would I use one vs. the other?
git_index_get_bypath lets you look up a particular entry a given index.
git_status_foreach_ext does a status check, which is a comparison between the worktree, the index and HEAD and iterates over the results of that comparison calling the passed function. With that flag, it would skip the worktree in that comparison.
Which one to use depends on what you're looking for: a particular entry in the index or a list of differences between the index and HEAD.

Lucene not giving results when specifying field

I have a database which I have indexed in Lucene (using Pylucene) by section (specified by markup in the document) using lucene's fields. This index seems to work fine. I can search it using the default field which is simply the entire document and get reasonable results.
The problem is, when I search it using a specific section (not the default), I expect to get a certain number of results back (as specified by IndexSearcher.search(query, results)), but instead it might simply return nothing. So my question is: how can I get it to return a ranked list with the number of results I specify?
The only place I specify the field is in the QueryParser, by calling:
QueryParser(Version.LUCENE_CURRENT, field, StandardAnalyzer)
I would verify the index using Luke (which is something I do often when modifying my index strategy).

How to retrieve results by date range and sort using SOLR with ColdFusion 9.0.1?

I'm using ColdFusion 9.0.1 and the integrated SOLR full text search engine.
I have dates stored in my SQL Server database as datetime fields for upcoming events. I took these records and inserted them into a SOLR collection with the custom3 and custom4 fields being the dateStart and dateEnd dates respectively. Users want to query the collection against a date range and sort by closest date to now.
First question: How do we set the datatype for the custom1-4 fields? Or, can we? Based on this post, Optimizing Solr for Sorting, the field should be set to either tdate or date rather than string for best performance. Or does SOLR automatically make the field have the correct datatype based on this post, Sort by date in Solr/Lucene performance problems?
Second question: How would the search criteria be structured to pull records? How about between May 1, 2011 and July 31, 2011, for example?
I don't tell too many people this, but for you, I believe it's time to ditch CFINDEX/CFSEARCH, and start using Solr directly.
CF's implementation is built for indexing a large block of text with some attributes, not a query. If you start using Solr directly, you can create your own schema, and have far more granular control of how your search works. Yes, it's going to take longer to implement, but you will love the results. Filtering by date is just the beginning.
Here's a quick overview of the steps:
Create a new index using the CFAdmin. This is the easy way to create all the files you need.
Modify the schema. The schema is in [cfroot]/solr/multicore/[your index name]/conf/
The top half of the schema is <types>. This defines all the datatypes you could use. The bottom half is the <fields>, and this is where you're going to be making most of your changes. It's pretty straightforward, just like a table. Create a field for each "column" you want to include. "indexed" means that you want to make that field searchable. "stored" means that you want the exact data stored, so that you can use it to display results. Because I'm using CF9's ORM, I don't store much beyond the primary key, and I use loadEntityByPK() on my results page.
After modifying the schema, you need to restart the solr service/daemon.
Use http://cfsolrlib.riaforge.org/ to index your data (the add method is a 'insert or modify' style method), and to perform the search.
To do a search, check out this example. It shows how to sort and filter by date. I didn't test it, so the format of the dates might be wrong, but you'll get the idea. http://pastebin.com/eBBYkvCW
Sorry this is answer is so general, I hope I can get you going down the right path here :)

How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?

I've found how to sort query results by a given field in a Lucene.Net index instead of by score; all it takes is a field that is indexed but not tokenized. However, what I haven't been able to figure out is how to sort that field while ignoring stop words such as "a" and "the", so that the following book titles, for example, would sort in ascending order like so:
The Cat in the Hat
Horton Hears a Who
Is such a thing possible, and if yes, how?
I'm using Lucene.Net 2.3.1.2.
I wrap the results returned by Lucene into my own collection of custom objects. Then I can populate it with extra info/context information (and use things like the highlighter class to pull out a snippet of the matches), plus add paging. If you took a similar route you could create a "result" class/object, add something like a SortBy property and grab whatever field you wanted to sort by, strip out any stop words, then save it in this property. Now just sort the collection based on that property instead.
When you create your index, create a field that only contains the words you wish to sort on, then when retrieving, sort on that field but display the full title.
It's been a while since I used Lucene but my guess would be to add an extra field for sorting and storing the value in there with the stop words already stripped. You can probably use the same analyzers to generate this value.
There seems to be a catch-22 in that you must tokenize a field with an analyzer in order to strip punctuation and stop words, but you can't sort on tokenized fields. How then to strip the stop words without tokenizing?
For search, I found search lucene .net index with sort option link interesting to solve ur problem