How to reduce a string to ASCII 7 characters for indexing purposes? - indexing

I am working on an application which must index certain sentences. Currently using Java and PostgreSQL. The sentences may be in several languages like French and Spanish using accents and other non-ASCII symbols.
For each word I want to create an index-able equivalent so that a user can perform a search insensitive to accents (transliteration). For example, when the user searches "nacion" it must find it even if the original word stored by the application was "Nación".
What could be the best strategy for this? I am not necessarily restricted only to PostgreSQL, nor the internal indexed value needs to have any similarity with the original word. Ideally, it should be a generic solution for converting any Unicode string into an ASCII string insensitive to case and accents.
So far I am using a custom function shown below which naively just replaces some letters with ASCII equivalents before storing the indexed value and does the same on query strings.
public String toIndexableASCII (String sStrIn) {
if (sStrIn==null) return null;
int iLen = sStrIn.length();
if (iLen==0) return sStrIn;
StringBuilder sStrBuff = new StringBuilder(iLen);
String sStr = sStrIn.toUpperCase();
for (int c=0; c<iLen; c++) {
switch (sStr.charAt(c)) {
case 'Á':
case 'À':
case 'Ä':
case 'Â':
case 'Å':
case 'Ã':
sStrBuff.append('A');
break;
case 'É':
case 'È':
case 'Ë':
case 'Ê':
sStrBuff.append('E');
break;
case 'Í':
case 'Ì':
case 'Ï':
case 'Î':
sStrBuff.append('I');
break;
case 'Ó':
case 'Ò':
case 'Ö':
case 'Ô':
case 'Ø':
sStrBuff.append('O');
break;
case 'Ú':
case 'Ù':
case 'Ü':
case 'Û':
sStrBuff.append('U');
break;
case 'Æ':
sStrBuff.append('E');
break;
case 'Ñ':
sStrBuff.append('N');
break;
case 'Ç':
sStrBuff.append('C');
break;
case 'ß':
sStrBuff.append('B');
break;
case (char)255:
sStrBuff.append('_');
break;
default:
sStrBuff.append(sStr.charAt(c));
}
}
return sStrBuff.toString();
}

String s = "Nación";
String x = Normalizer.normalize(s, Normalizer.Form.NFD);
StringBuilder sb=new StringBuilder(s.length());
for (char c : x.toCharArray()) {
if (Character.getType(c) != Character.NON_SPACING_MARK) {
sb.append(c);
}
}
System.out.println(s); // Nación
System.out.println(sb.toString()); // Nacion
How this works:
It splits up international characters to NFD decomposition (ó becomes o◌́), then strips the combining diacritical marks.
Character.NON_SPACING_MARK contains combining diacritical marks (Unicode calls it Bidi Class NSM [Non-Spacing Mark]).

The one obvious improvement for your current code: use a Map<Character, Character> that you prefill with your mappings.
And then simply check if that Map has a mapping; of so; use that; otherwise use the original character.
And as Androbin explains, there are special maps that do not rely on objects, but work with primitive types, like this trove. So, depending on your solution and requirements; you could look into that.

Related

JavaScript ECMA script derived grammar error

I'm creating a simplified grammar derived from the ECMA Script grammar:
https://github.com/antlr/grammars-v4/blob/master/ecmascript/ECMAScript.g4
The original grammar works fine so far and I added a few rules only. However the following expression is a valid expression in my grammar but throws an error in the original ECMA grammar:
round(frames++/((getTime()-start)/1000))
caused by the
frames++/
expression.
Whereas the following similar expressions works:
round(frames++*((getTime()-start)/1000))
round(frames++%((getTime()-start)/1000))
My question is how can I make the first expression work and what is the difference?
That is probably because the / is interpreted as the start of a regex delimiter.
If you look at the lexer-method that determines when a / is a regex literal and when a division operator:
/**
* Returns {#code true} iff the lexer can match a regex literal.
*
* #return {#code true} iff the lexer can match a regex literal.
*/
private boolean isRegexPossible() {
if (this.lastToken == null) {
// No token has been produced yet: at the start of the input,
// no division is possible, so a regex literal _is_ possible.
return true;
}
switch (this.lastToken.getType()) {
case Identifier:
case NullLiteral:
case BooleanLiteral:
case This:
case CloseBracket:
case CloseParen:
case OctalIntegerLiteral:
case DecimalLiteral:
case HexIntegerLiteral:
case StringLiteral:
// After any of the tokens above, no regex literal can follow.
return false;
default:
// In all other cases, a regex literal _is_ possible.
return true;
}
}
then this should also return false for the tokens -- and ++.
Adjust the isRegexPossible() as follows:
private boolean isRegexPossible() {
if (this.lastToken == null) {
// No token has been produced yet: at the start of the input,
// no division is possible, so a regex literal _is_ possible.
return true;
}
switch (this.lastToken.getType()) {
case Identifier:
case NullLiteral:
case BooleanLiteral:
case This:
case CloseBracket:
case CloseParen:
case OctalIntegerLiteral:
case DecimalLiteral:
case HexIntegerLiteral:
case StringLiteral:
case PlusPlus: // <-- NEW
case MinusMinus: // <-- NEW
// After any of the tokens above, no regex literal can follow.
return false;
default:
// In all other cases, a regex literal _is_ possible.
return true;
}
}

How to compare strings instead of integers in switch case?

I have this code who try to compare strings in Switch Case:
char input[50+1];
fgets( input, 50, stdin );
switch (input) {
case "register": NSLog(#"Voce escolheu a opcao de cadastro");
break;
case "enter": NSLog(#"Voce escolheu a opcao de entrada");
break;
case "exit": NSLog(#"Voce escolheu a opcao de saida");
break;
}
This command returns me an error, because I believe that we can not write a text after the 'case' command. I would have someone could help me solve this problem, I believe there are other ways to make a Switch Case using strings, but how?
The lookup option works pretty well. Consider:
NSArray *strings = #{#"string1", #"string2"};
NSUInteger index = [strings indexOfObject:input];
switch(index) {
case 0:
//stuff for string 1;
case 1:
// stuff for string 2:
case NSNotFound:
// not found;
}
You can't. Switch only works with integers. The best options are a chain of if-else statements or a lookup table (e.g. an NSDictionary).

Need regular expression that will work to find numeric and alpha characters in a string

Here's what I'm trying to do. A user can type in a search string, which can include '*' or '?' wildcard characters. I'm finding this works with regular strings but not with ones including numeric characters.
e.g:
414D512052524D2E535441524B2E4E45298B8751202AE908
1208
if I look for a section of that hex string, it returns false. If I look for "120" or "208" in the "1208" string it fails.
Right now, my regular expression pattern ends up looking like this when a user enters, say "w?f": '\bw.?f\b'
I'm (obviously) not well-versed in regular expressions at the moment, but would appreciate any pointers someone may have to handle numeric characters in the way I need to - thanks!
Code in question:
/**
*
* #param searchString
* #param strToBeSearched
* #return
*/
public boolean findString(String searchString, String strToBeSearched) {
Pattern pattern = Pattern.compile(wildcardToRegex(searchString));
return pattern.matcher(strToBeSearched).find();
}
private String wildcardToRegex(String wildcard){
StringBuffer s = new StringBuffer(wildcard.length());
s.append("\\b");
for (int i = 0, is = wildcard.length(); i < is; i++) {
char c = wildcard.charAt(i);
switch(c) {
case '*':
s.append(".*");
break;
case '?':
s.append(".?");
break;
default:
s.append(c);
break;
}
}
s.append("\\b");
return(s.toString());
}
Let's assume your string to search in is
1208
The search "term" the user enters is
120
The pattern then is
\b120\b
The \b (word boundary) meta-character matches beginning and end of "words".
In our example, this can't work because 120 != 1208
The pattern has to be
\b.*120.*\b
where .* means match a variable number of characters (including null).
Solution:
either add the .*s to your wildcardToRegex(...) method to make this functionality work out-of-the-box,
or tell your users to search for *120*, because your * wildcard character does exactly the same.
This is, in fact, my preference because the user can then define whether to search for entries starting with something (search for something*), including something (*something*), ending with something (*something), or exactly something (something).

Char.IsSymbol("*") is false

I'm working on a password validation routine, and am surprised to find that VB does not consider '*' to be a symbol per the Char.IsSymbol() check.
Here is the output from the QuickWatch:
char.IsSymbol("*") False Boolean
The MS documentation does not specify what characters are matched by IsSymbol, but does imply that standard mathematical symbols are included here.
Does anyone have any good ideas for matching all standard US special characters?
Characters that are symbols in this context: UnicodeCategory.MathSymbol, UnicodeCategory.CurrencySymbol, UnicodeCategory.ModifierSymbol and UnicodeCategory.OtherSymbol from the System.Globalization namespace. These are the Unicode characters designated Sm, Sc, Sk and So, respectively. All other characters return False.
From the .Net source:
internal static bool CheckSymbol(UnicodeCategory uc)
{
switch (uc)
{
case UnicodeCategory.MathSymbol:
case UnicodeCategory.CurrencySymbol:
case UnicodeCategory.ModifierSymbol:
case UnicodeCategory.OtherSymbol:
return true;
default:
return false;
}
}
or converted to VB.Net:
Friend Shared Function CheckSymbol(uc As UnicodeCategory) As Boolean
Select Case uc
Case UnicodeCategory.MathSymbol, UnicodeCategory.CurrencySymbol, UnicodeCategory.ModifierSymbol, UnicodeCategory.OtherSymbol
Return True
Case Else
Return False
End Select
End Function
CheckSymbol is called by IsSymbol with the Unicode category of the given char.
Since the * is in the category OtherPunctuation (you can check this with char.GetUnicodeCategory()), it is not considered a symbol, and the method correctly returns False.
To answer your question: use char.GetUnicodeCategory() to check which category the character falls in, and decide to include it or not in your own logic.
If you simply need to know that character is something else than digit or letter,
use just
!char.IsLetterOrDigit(c)
preferably with
&& !char.IsControl(c)
Maybe you have the compiler option "strict" of, because with
Char.IsSymbol("*")
I get a compiler error
BC30512: Option Strict On disallows implicit conversions from 'String' to 'Char'.
To define a Character literal in VB.NET, you must add a c to the string, like this:
Char.IsSymbol("*"c)
IsPunctuation(x) is what you are looking for.
This worked for me in C#:
string Password = "";
ConsoleKeyInfo key;
do
{
key = Console.ReadKey(true);
// Ignore any key out of range.
if (char.IsPunctuation(key.KeyChar) ||char.IsLetterOrDigit(key.KeyChar) || char.IsSymbol(key.KeyChar))
{
// Append the character to the password.
Password += key.KeyChar;
Console.Write("*");
}
// Exit if Enter key is pressed.
} while (key.Key != ConsoleKey.Enter);

How do I use a boolean operator in a case statement?

I just Don't understand how to use a boolean operator inside a switch statement
switch (expression) {
case > 20:
statements
break;
case < -20:
statements
break;
}
Edit:
I don't want an If () statement.
You can't. Use if() ... else ....
The nearest thing available to what you want uses a GCC extension and is thus non-standard. You can define ranges in case statements instead of just a value:
switch(foo)
{
case 0 ... 20: // matches when foo is inclusively comprised within 0 and 20
// do cool stuff
break;
}
However, you can't use that to match anything under a certain value. It has to be in a precise range. Switches can only be used to replace the comparison operator against a constant, and can't be used for anything more than that.
switch ((expression) > 20) {
case true:
statements
break;
case false:
default:
statements
break;
}
What.. you want more than 1 boolean in a case? You could do this
int ii = ((expression) > 20) + 2 * ((expression) < -20);
switch (ii) {
case 1:
statements
break;
case 2:
statements
break;
}
This, IMO is pretty bad code, but it is what you asked for...
Just use the if statement, you'll be better off in the long run.