How do I identify which letter of the alphabet a word starts with in Objective-C? - objective-c

Given a string, I'm trying to determine which letter of the alphabet it belongs to. For example, "apple" goes into the "A" section. "Banana" goes into the "B" section. I'm using this to identify the section:
NSRange range = [string rangeOfString:letter
options:NSAnchoredSearch |
NSCaseInsensitiveSearch |
NSDiacriticInsensitiveSearch |
NSWidthInsensitiveSearch
range:NSMakeRange(0, string.length)
locale:locale];
Where string is the string I'm trying to bucket and letter is a letter of the alphabet. I do this in a loop for each letter of the alphabet.
It works great, except for words like "æquo", which should be bucketed into the letter "A", but aren't. What to do?
Edit The plot thickens. I'm looking at Korean now. The word "것" should be bucketed into the letter "ㄱ". There's got to be some way to do this other than maintaining a huge mapping table.

I think I've figured it out: I was thinking about it wrong. The question isn't, does a given word begin with a certain letter of the alphabet. Rather, the question is, does a given word fall within the sorting range of a certain letter of the alphabet.
For example, in the case of "æquo", I can check if it falls within the sorting range of the "A" section by checking if it is or comes after "A", and comes before "B".
Apple's compare:options:range:locale: method knows the answer to those two questions for any given locale. In this particular example, for French it would say yes. For some other language, like Danish, it should say no.
I've tested this on English, Spanish, Portuguese, French, German, and Korean, and it appears to be giving the expected results.

Related

How does camel case work with lower case proper nouns?

As the title says how would camel case, which is: theQuickBrownFox, work when one of the words starts with a lower case letter followed by a capital such as is the case in iPhone.
getiPhoneNumber() for instance looks weird.
Would it be getIphoneNumber() or getIPhoneNumber() or what?
what if it was the first word? iPhoneNumber vs iphoneNumber? Since only each different word should be capitalized.
Any algorithm that separates a camel case identifier back into individual words would correctly produce: get IPhone number from getIPhoneNumber and would be fooled by getiPhoneNumber because it would separate this into geti phone number. Therefore, the correct naming is getIPhoneNumber.
Using the very same criterion I would use iphoneNumber rather than iPhoneNumber.
EDIT
Given that there is some consensus about this question being opinion-based I would like to say that any criterion on how to capitalize a sentence using the camel case convention shouldn't be opinion based but consistent with whatever separation algorithm one would happen to use.

What is the difference between an Alphabet and an element of a set?

What is the difference between an Alphabet and an element of a set?
Whether Alphabet is an element of a set or it is a set itself?
It might be a little more correct to say that an Alphabet is a domain, whose definition consists of: "a set." In other words, an Alphabet is the set of all possible letters, such that any symbol that is not within that set, is not "a letter."
Notice that "a word" is not "a set," but rather "a collection" of "letters," because any word (such as the word, "letters") might contain the same letter many times.
if we only talk about automata theory, an Alphabet is a set of elements(letters).
So an Alphabet is an example of set where elements of this set are letters.
For exemple an alphabet A = {a, b, c} where 'a' is one of its elements.
I don't know if it is the answer you want. Maybe you could precise your question if it is not ? :)
EDIT : But you can have a set of sets like :
K = { {a, b, c}, {m, n, o} } which contains two Alphabets. But here K isn't an Alphabet anymore, it's a set of Alphabet.
Bye.

How to recognize if word has no meaning, maybe some impossible syllables?

Initially, I have m arrays of n characters, where each array contains unknown (for me) character of needed word (condition: word has meaning).
For example, m = 4, n = 3: array0 = {'t', 'e', 'c'}, array1 = {'g' 'o' 'a'}, array2 = {'w' 'd' 'y'}, array3 = {'e' 'o' 's'}. Each array contains only one correct letter: in array0 is first letter, in array1 - second... So, the probable secret word is 'code': array0[2] = 'c', array1[1] = 'o', array2[1] = 'd', array3[0] = 'e'.
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words.
Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
I'm attacking Vigenere's cipher. So, I know the length of key and its probable characters. I'm shuffling my arrays and getting many meaningless words. Problem is to filter them. As I get it, some conditions can help to recognize incorrect words. For example, if word length is > 4 then all vowel chars, or all consonant chars word is wrong. Some syllables, such as kk *hh* ww, in general, are impossible too. Where can I find such rules?
I'm supposing what you mean by the "word has meaning" is that it is an English dictionary word.
I believe that you should approach the problem from the other direction, as GregS suggests, and go through a dictionary. English has many exceptions when it comes to letters and spelling, and the number of words that look English are much greater than the actual number of English words. You won't be able to cut down your search very much in that way.
But because you know the length and probable characters you are able to quickly throw out many dictionary words. Also, if the message isn't too short, it would also be very fast to attempt a decoding of the message with possible words, and throw out unlikely decodings by letter, digram or trigram frequencies.
I'm not sure I follow your strategy for attacking a Vigenere cipher. However, in response to:
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words. Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
Yes, indeed there is a plethora of such rules. There are two ways of learning and implementing these rules:
Carefully study the morphology of English, and meticulously implement the rules.
Train a Markov model on a corpus of English text.
1 will be substantially less work for little additional benefit.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")