Sitecore: Lucene Index item id is stored without curly braces - lucene

I have the following configuration for storing a field.
<fieldType fieldName="Profile Id" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.Guid" nullValue="NULL" emptyString="EMPTY" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" />
When I check the index with LukeAll I see the id as:
I don't know why the curly braces are gone and why all the characters are in lowercase. I want to store it as normal guid like in sitecore with curly braces and all chars in uppercase.
I also tried with type="System.string" but its still the same.

Actually, because your field is TOKENIZED, Sitecore stores your ID the way it does to avoid another situation from happening. TOKENIZED means, your ID will be broken down internally in Lucene like this:
c50e5028
8eba
4ba9
854cf
(you get the picture)
So if you search Lucene for 8eba it will match your profile_id field as you see it now. Which is very rarely what one would expect.
To avoid this issue; don't put a Sitecore ID in the index. Not a Guid either. (there are other workarounds, but I'm showing you the simpler approach here).
Use item.ID.ToShortID() - this generates a Guid that is without curly braces and without dashes. When you later want to compare (or query), just match it up using the same .ToShortID() method.

To me it looks like your original value is not containing curly braces.
If the field value contains curly braces (and storageType="YES") Luke will show you the value that was indexed as apposed the index data (which may be very different based on analyzer used).
If you truly want the index data to contain curly braces either set indexType="UN_TOKENIZED" or choose something like the Lucene.Net.Analysis.WhitespaceAnalyzer for the field.

Related

Match an exact phrase in Solr

I have indexed data in solr. I have many phrases such as 'The Dark Knight' and 'The Dark Knight Rises'. When I query for 'The Dark Knight', I get both the results, I want match this query with only 'The Dark Knight' and not the 'The Dark Knight Rises', is it possible to this
If you want exact matches you could use a String field in Solr. I suspect the field that you are searching on is a Text field.
Be aware that on String fields the search will only succeed if the values are identical (including case, punctuation, spaces).
You can use the Analysis Screen to see how matches will be made for these two different kind of field types.
Yes, it is possible as discussed by "Hector Correa". Following should be the way to update schema for exact search.
<field name="exact_field" type="string" indexed="true" stored="true"/>
I have used this case for dictionary implementation where exact word is searched. "string" field type is the solution.

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?
Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.

NSPredicate, whitespaces in CoreData. How to trim in predicate?

I have a CoreData/SQLite application in which I have "Parent Categories" and "Categories". I do not have control over the data, some of the "Parent Categories" values have trailing white spaces.
I could use CONTAINS (or I should say it works with CONTAINS but this is something I can not use). For example I have 2 entries, MEN and MENS. If I use CONTAINS I will return both records, you can see how this would be an issue.
I can easily trim on my side, but the predicate will compare that with the database and will not match. So my question is how can I account for whitespaces in the predicate, if possible at all.
I have a category "MENS" which someone has selected in the application, and it is compared against "MENS " in the database.
I would trim the data prior to doing the lookup. You can do this easily usingstringByTrimmingCharactersInSet. By doing it beforehand, you'll also avoid any performance hit. That could be expensive if you're doing a character based comparison withCONTAINS.
So, let's say your search string is "MEN".
Here's the way to strip out any dodgy characters:
NSString *trimmed = [#"MEN " stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
There's alsowhitespaceAndNewlineCharacterSetwhich does what it says on the tin.
Alternatively, it's easy to create your own custom character of stuff you want to trim.
For that, have a look at:
NSCharacterSet Class Reference
and
Apple's String Programming Guide

Extract terms from query for highlighting

I'm extracting terms from the query calling ExtractTerms() on the Query object that I get as the result of QueryParser.Parse(). I get a HashTable, but each item present as:
Key - term:term
Value - term:term
Why are the key and the value the same? And more why is term value duplicated and separated by colon?
Do highlighters only insert tags or to do anything else? I want not only to get text fragments but to highlight the source text (it's big enough). I try to get terms and by offsets to insert tags by hand. But I worry if this is the right solution.
I think the answer to this question may help.
It is because .Net 2.0 doesnt have an equivalent to java's HashSet. The conversion to .Net uses Hashtables with the same value in key/value. The colon you see is just the result of Term.ToString(), a Term is a fieldname + the term text, your field name is probably "term".
To highlight an entire document using the Highlighter contrib, use the NullFragmenter