Strange behavior of Lucene SpanishAnalyzer class with accented words

Strange behavior of Lucene SpanishAnalyzer class with accented words - lucene

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion". If I instead parse "maratón" and "maraton", I'm getting the same stem for both words ("maraton").
So, at least in my opinion, it's very strange that the same word, "comunicación", gives different results depending on it is accented or not. If I search the word "comunicacion", I should get the same result regardless of whether it's accented or not.
The code I'm using is the next one:
SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion
The solution I've found to be able to get every single word that shares the stem of "comunicacion", accented or not, is to take off accents in a first step, and then parse it with the Analyzer, but I don't know if it's the right way.
Please, can anyone help me?

Did you check what tokenizer & tokenfilters SpanishAnalyzer uses? There is something called ASCIIFoldingFilter. Try placing it before the StemFilter. It will remove the accents

Related

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.

From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

Java - Index a String (Substring)

I have this string:
201057&channelTitle=null_JS
I want to be able to cut out the '201057' and make it a new variable. But I don't always know how long the digits will be, so can I somehow use the '&' as a reference?\
myDigits substring(0, position of &)?
Thanks

Sure, you can split the string along the &.
String s = "201057&channelTitle=null_JS";
String[] parts = s.split("&");
String newVar = parts[0];
The expected result here is
parts[0] = "201057";
parts[1] = "channelTitle=null_JS";
In production code you chould check of course the length of the parts array, in case no "&" was present.
Several programming languages also support the useful inverse operation
String s2 = parts.join("&"); // should have same value like s
Alas this one is not part of the Java standard libs, but e.g. Apache Commons Lang features it.

Always read the API first. There is an indexOf method in String that will return you the first index of the character/String you gave it.
You can use myDigits.substring(0, myDigits.indexOf('&');
However, if you want to get all of the arguments in the query separately, then you should use mvw's answer.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.

I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.

Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Remove & character from string objective c

How would I go about removing the "&" symbol from a string. It's making my xml parser fail.
I have tried
[currentParsedCharacterData setString: [currentParsedCharacterData stringByReplacingOccurrencesOfString:#"&" withString:#"and"]];
But it seems to have no effect

Really what this boils down to is you want to gracefully handle invalid XML. The XML Parser is properly telling you that this XML is invalid, and is thusly failing to parse. Assuming you have no control over this XML content, I would suggest pre-parsing it for common errors like this, the output of which would be a sanitized XML doc that has a better chance of success.
To sanitize the doc, it may be as simple as doing search and replace, the problem with just doing a blanket replace on any & is that there are valid uses of &, for example & or ©. You would end up munging the XML by creating something like this: andcopy;
You could search for "ampersand space" but that won't catch a string that has an ampersand as the last character (an out-case that might be easily handled). What you are really searching for are occurrences of & that are not followed by a ; or those of which where any type of whitespace is encountered before the following ; because the semi-colon is fine on its own.
If you need more power because you need to detect this, and other errors, I would suggest going to NSScanner or RegEx matching to search for occurrences of this and other common errors during your sanitization step. It is also very common for XML files to be rather large things, so you need to be careful when dealing with these as in-memory strings as this can easily lead to application crashes. Breaking it up into manageable chunks is something NSScanner can do very well.

For a quick attempt look at stringByReplacingOccurrencesOfString on NSString
NSString* str = #"a & b";
[str stringByReplacingOccurrencesOfString:#"&" withString:#"and"]; // better replace by &
However you should also deal with other characters i.e. < >

Using MultiFieldQueryParser

Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.

I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.

What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).

(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Strange behavior of Lucene SpanishAnalyzer class with accented words - lucene

Did you check what tokenizer & tokenfilters SpanishAnalyzer uses? There is something called ASCIIFoldingFilter. Try placing it before the StemFilter. It will remove the accents

Related

Regex matching sequence of characters

Java - Index a String (Substring)

User input text translation

Remove & character from string objective c

Using MultiFieldQueryParser

Categories

Resources