Alphanumeric range query - lucene

Is there an effective way to handle alphanumeric ranges in lucene?
Example ranges,
1 to 1 (includes 1A, 1B.. 1Z)
10A12 to 10A22 (includes 10A12, 10A13.. 120A22)
1 to 10 (includes 1A,1B..,2A,2B..,9Z,10) [Does not include 10A]
I have two approaches:
Expand each range and index all possible values. I guess the unique values won't be huge.
Index on low and high values. Then use range query. Not sure, how effective is range query on alphanumeric ranges
Need expert advice on this, please.

I hope you agree that your defined rules are very customary and not really suitable for a generic framework, such as Lucene. For example, why would range [1..1] include letters but [1..10] wouldn't?
I don't know if it is possible with your data set, but if you could come up with rules, converting each element (including element having letters) into a unique number using some arbitrary formula, you could use this formula both when indexing and querying. This would even allow range matching.

Related

Lucene query language and numeric range

I'm applying the following Lucene query predicate in order to get all inclusive numbers in 2 to 6 range:
value:[2 TO 6]
and receive the documents with the following values:
567986400000
567986400000
567986400000
536450400000
536450400000
599608800000
536450400000
567986400000
I'm interested in the numeric range query and obviously, for example, the Long value 567986400000 is not in the range of [2 TO 6]. Looks like the range searches are strings and I don't know how to workaround it in mine application for the different numeric values.
How to properly use numeric range queries in Lucene?
To achieve a proper range query you need to use specific defined fields from lucene. See Field javadoc
IntPoint: int indexed for exact/range queries.
LongPoint: long indexed for exact/range queries.
FloatPoint: float indexed for exact/range queries.
DoublePoint: double indexed for exact/range queries.
So you need to be sure that your field you add this query is one of this types. As you said you use a Neo4j generated lucene index. There has to be an option to create this kind of fields otherwise you're not able to execute proper range queries.

Solr - Storing offsets and positions of numericValues

Just would like to know if it is possible to store the offsets,positions and frequencies of numeric values of int,float,double types in Solr. For terms, we have character and token attributes to which offsets can be set but for numeric values, while storing as Trie or Sortable, Is it possible to set the offset or attributes for the same?
I have tried considering payloads and payload filters but not able to understand which one would be best for this and also whether it is possible to perform range queries on payload values.
Otherwise, there is also the use of IndexOptions for setting this:DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS to the field.But again not sure if this is applicable for anything other than terms/characters.Another candidate, NumericTermAttribute doesnot have position or attribute set functions.
It is fine to store them the numeric values as terms and perform sortable str search sort but why would i do that when i have a Trie which is more efficient considering the performance?

Simple thing about Index on sqlite

I have a sqlite database with 3 column:
id, word, bitmask
I make a bitmask out of the vowels in the word, so I can quickly find every word that contains a certain vowel:
SELECT word FROM words WHERE bitmask & 7 = 0
I have two questions.
Should I add an index? If that case, to what column and how do I write the query
I tried the code below, but didn't see any improvements in performance.
CREATE INDEX bitmask_index ON words (bitmask);
The "bitmask" column contains values from 1-256. Would it be a good thing to sort the "bitmask" column by value? In that case, how do I write the query for this?
Indexing is unlikely to help, because you are applying a function to the value before the search. Typically, this kills the effects of indexing.
Sorting bitmasks rarely makes sense, because bit positions in bitmaps do not correspond to something that is ordered across the rows (their ordering is associated with something inside the same row, e.g. the vowels in some word).

Fastest way to find exact match of long text string in large SQL db

My db table will have a col "a" TEXT with long strings, like multiple paragraphs. Given an input string, I want to find the one matching record. If the table has millions of rows, what would be faster? A simple
WHERE a = ?
Or should I calc and store a md5 hash of each row and then match that? Suggestions welcome.
If you want an exact match, it will be much quicker to store the hash and compare to that. It will preclude substring searches, but it's much quicker to compare say 4 characters than to check thousands.
There will be some overhead to calculate the hash on your search parameter, but it's nothing compared to a string comparison against that much data.
if you are using SQL Server you could for the Full-Text-Search feature
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Fastest way to find string by substring in SQL?

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
Sounds like you've ruled out all good alternatives.
You already know that your query
SELECT * FROM t WHERE TITLE LIKE '%abc%'
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
SELECT * FROM t WHERE TITLE LIKE 'abc%'
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')
Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
Use ASCII charset with clustered indexing the char column.
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)