Lucene numeric range search with LUKE - indexing

I have a number of numeric Lucene indexed fields:
60000
78500
105000
If I use LUKE to query for 78500 as follows:
price:78500
It returns the correct record, however if I try to return all three record as a range I get no results.
price:[60000 TO 105000]
I realise this is due to padding as numbers are treated strings by Lucene however I just wish to know what I should be putting into LUKE to return the three records.
Many thanks for any help.

If the fields are indexed as NumericField you must use "Use XML Query Parser" option in query parser tab and the 3.5 version of Luke:
https://code.google.com/p/luke/downloads/detail?name=lukeall-3.5.0.jar&can=2&q=
An example of query with a string and numeric field is:
<BooleanQuery>
<Clause fieldName="colour" occurs="must">
<TermQuery>rojo</TermQuery>
</Clause>
<Clause fieldName="price" occurs="must">
<NumericRangeQuery type="int" lowerTerm="4000" upperTerm="5000" />
</Clause>
</BooleanQuery>

The solution I used for this was that the values inputted for price needed to be added to the index in padded form. Then I would just query the new padded value which works great. Therefore the new values in the index were:
060000
078500
105000
This solution was tied into an Examine search issue for Umbraco so there is a thread on the Forum of how to implement a numeric based range search if anyone requires this it is located here with a walk through end to end.
Umbraco Forum Thread

Zero padding won't come into this particular query since all the numbers you've shown have the same number of digits
The range query you've shown has too many zeros on the second part of the range
So the query for the data you've shown would be price:[10500 TO 78500]
Hope this helps,

I assume these fields are indexed as NumericFields. The problem with them is that Lucene/Luke does not know how to parse numeric queries automatically. You need to override Lucene's QueryParser and provide your own logic how these numbers should be interpreted.
As far as I know, Luke allows sticking in your custom parser, it just need to be present in the CLASSPATH.
Have a look at this thread on Lucene mailing list:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201102.mbox/%3CAANLkTi=XUpyw09tcbjuTzNRpMJa730Cq-6_1agMAjYz6#mail.gmail.com%3E

Related

Lucene.Net range subquery not returning expected results (RavenDB related)

I'm trying to write a lucene query to filter some data in RavenDB. Some of the documents in this specific collection are assigned a sequential number, and the valid ranges are not continuous (for example, one range can be from 100-200 and another from 1000 to 1400). I want to query RavenDB using Raven Studio (v2.5, the Silverlight client) to retrieve all documents that have values outside of these user-defined ranges.
This is the overly simplified document structure:
{
ExternalId: something/1,
SequentialNumber: 12345
}
To test, I added 3500 documents, all of which have a SequentialNumber that's inside one of the following two ranges: 123-312 and 9000-18000, except for one that has 100000123. The ExternalId field is a reference to the parent document, and for this test all the documents have the field set to something/1. This is the Lucene query I came up with:
ExternalId: something/1 AND NOT
(SequentialNumber: [123 TO 321] OR SequentialNumber: [9000 TO 18000])
Running the query in RavenDB's Studio returns all the documents where SequentialNumber isn't in the 123-321 range. I would expect it to only return the document that has 100000123 as a SequentialNumber. I've been trying to Google for help, but so far I haven't found anything to steer me into the right direction.
What am I doing wrong?
RavenDB is indexing numbers in two ways, once as strings (which is what you see here) and once in numeric form.
For range queries use:
SequentialNumber_Range: [Ix123 TO Ix321] OR SequentialNumber_Range: [Ix9000 TO Ix18000])
The Ix prefix means that you are using int32

Basic Lucene Beginners q: Index and Autocomplete

I am using Lucene.NET and have a basic question:
Do I need to make an additional Index for Autocompletion?
I created an index based on two different tables from a database.
Here are two Docs:
stored,indexed,tokenized,termVector<URL:/Service/Zahlungsmethoden/Teilzahlung>
stored,indexed,tokenized,termVector<Website:Body:The Text of the first Page>
stored,indexed,tokenized,termVector<Website:ID:19>
stored,indexed,tokenized,termVector<Website:Title:Teilzahlung>
stored,indexed,tokenized,termVector<URL:/Service/Kundenservice/Kinderbetreeung>
stored,indexed,tokenized,termVector<Website:Body:The text of the second Page>
stored,indexed,tokenized,termVector<Website:ID:13>
stored,indexed,tokenized,termVector<Website:Title:Kinderbetreuung>
I need to create a dropdown for a search with suggestions:
eg: term "Pag" should suggest "Page"
so I assume that for every word (token) in every doc, I need a list like:
p
pa
pag
page
is this correct?
Where do I store these?
In an additional Index?
Or how would I re-arrange the existing structure of my index to hold the autocompletion-suggestions?
Thank you!
1) Like femtoRgon said above, look at the Lucene Suggest API.
2) That being said, one cheap way to perform auto-suggest is to look for all words that start with the string you've typed so far, like 'pa' returning 'pa' + 'pag' + 'page'. A wildcard query would return those results -- in Lucene query syntax, a query like 'pa*'. (You might want to restrict the suggestions to only those strings of length 2+.)
Mark Leighton Fisher has the right approach for a cheap way but performing a wildcard query just returns you the documents but not the words. It's better to look at the imlpementation of the WildcardQuery maybe. You need to use the Terms object retrieved from the IndexReader and iterate through the terms in the index.

Guidance on creating a basic search function in Rails3

Still pretty new to Rails and hoping to develop a function on a site enabling a search to be performed of the manner detailed below:
User inputs a search term / phrase (string of words but unlikely to be more than 5 or 6)
String is chopped into its constituent words
Entries in a single model with a description (a single field in the model) are output
Having looked at previous questions on this site, I am aware that there are a number of add-ons which are commonly used for search queries, however, are these needed in such a simple situation?
I was thinking that I could use an SQL command with a number of ANDs to perform this task?
Currently the model is stored within sqlite3, but it is probably going to grow to about 100,000 lines (just 10 fields though) in the near future if this is likely to cause problems?
Finally, is there an easy way to pull out the words of a string automatically for any length of string / up to a certain limit that is unlikely to be exceeded?
Thanks in advance for your time and patience
You can easily pull the words from a string with ruby: 'alice bob charlie'.split(/\s+/) will give you an array with the words.
Then, you can string those words together into an SQL query to find the appropriate records. It don't know about the performance of this solution though... You should definitely test it out to see if there are any performance issues.

Contains() function falters with strings of numbers?

For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.
Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!
try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)
try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')

Lucene search and underscores

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE.
When I search for field:"MY_VALUE" however, the query is parsed as field:"my value"
Is there a simple way to escape the underscore (_) character so that it will search for it?
EDIT:
4/1/2010 11:08AM PST
I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
Load up Luke and try to search for "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:
"bb hhh_ffff5_ssss"
After some testing, I've found that this is because of the number. If I input
"BB_HHH_FFFF_SSSS", I get
"bb hhh ffff ssss"
At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.
Can anyone confirm this?
It doesn't look like you used the StandardAnalyzer to index that field. In Luke you'll need to select the analyzer that you used to index that field in order to match MY_VALUE correctly.
Incidentally, you might be able to match MY_VALUE by using the KeywordAnalyzer.
I don't think you'll be able to use the standard analyser for this use case.
Judging what I think your requirements are, the keyword analyser should work fine for little effort (the whole field becomes a single term).
I think some of the confusion arises when looking at the field with luke. The stored value is not what's used by queries, what you need are the terms. I suspect that when you look at the terms stored for your field, they'll be "my" and "value".
Hope this helps,