Storing word list in objective c - objective-c

I have previously made an anagram solver where if you gave a set of 9 letters, the program would find every possible 3-9 letter word that could be made out of those 9 letters.
I made this in javascript, where a word list of 100,000+ words were stored in a single array form which suitable answers could be found.
To find every subword of a 9 letter set, the program would only need to search through the whole array once, meaning that no matter what 9 letter set of letters you gave the program, the list of subwords were always given in under a second.
I am now making the same program but in objective-c as part of an ios app i intend to make.
Would there be any issues in storing a 100,000+ word list in an NSArray in objective-c?
Issues such as memory usage, look up speeds etc.
Are there any better ways of storing this word list that would make lookups faster or perhaps use less memory.
(I am a novice in objective-c)
Thank you for your time.

The simple answer is to try it and see. You can then use Instruments.app to see the performance.
You may find Alternative Objective-C object allocation for large arrays a worth while read.

Related

How to store huge amount of NSStrings for comparison purpose

I am writing a (linguistic) Morphology Mac Application. I often have to check if the Words in a given Text are in a huge List of Words (~1.000.000).
My Question is: How do i store these Lists ?
I use a .txt File to store the Words and create an NSSet from this File, which survives as long as the Application is launched.
I use a Database like SQLite.
Some points:
I think the focus should be on speed, because the analysis is triggered by the user and this comparisons make the largest part of the computation.
The Lists may change via updates.
I used CoreData and MySQL before, so (i think) i could realize both.
I have read a lot about the pro/cons of Database vs. File but i never thought its my usecase.
I dont know if its relevant which technik i use, because the size of these Files is relatively small (~20MB) and even with a lot of supported Languages, only 3-4 of this files will be loaded into memory at the same time.
Thanks! Danke!

What would be an efficient way to see if a word exist when comparing to hundreds of thousands of words?

Iv'e been working with objective-c for while and I want to attempt to build an app. The app will be similar to a scrabble type of game. There are going to be drag and drop tiles (I already know how to program this) and a submit button. The only thing that I'm having trouble with is figuring out how I am going to compare the letters on the board to hundreds of thousands words without bogging down my program to much. What I have in mind, so far, is to store these the words in a database. Does objective-c have any kind of built in api that can access a standard dictionary database? I'm not referring to a dictionary array but rather an actual database with words and possibly definitions. Any thoughts on this?
I have used Lexicontext, it's $20, but it's worth it in my opinion, is extremely fast, there's a demo, and it contains a API for formatting definitions with CSS.
Looks like you're happy with the $20 solution. For the more adventuresome, the data structure you'd want to learn about is tries.
Imagine a tree with 26 children of the root, one for each letter in the alphabet. Now imagine that each child has 26 children, too. You can spell any word of length N by taking N steps from root to leaf. Now imagine that you prune the tree so it contains only valid words. That's your (very fast) word validator. It will take as long on average as your average word length.
Depending on how "Scrabble-like" your game actually is....
Are you going to be validating the word when the player presses submit?
That's not how Scrabble works. The player can play any (non)word so long as the opponents do not challenge the word.
So you'll need word validation in the "challenge" system, but it shouldn't happen as soon as the player plays the word. An unscrupulous player could then just place "maybe" words and press submit, to see if it actually is a valid word.

Obj-C / iOS: Look through a document for any one of several thousand words?

As part of a document reader I'm writing for iPhone/iPad, I need the following functionality:
Search through a document of between appx 500 and 10000 words for words and phrases that appear in one of several lists. Each list contains between 100 and 5000 words and phrases. When I find a word in the document that appears in one of those lists, I mark it and move on.
I will know the word lists ahead of time, but the documents will be unknown until the moment they need to be processed.
And this needs to be VERY FAST.
Any help would be greatly appreciated!
This presentation and paper present a fast multi-pattern string search algorithm. It also mentions some predecessors, should this one not fit your needs.
Multifast is an open source (LGPLed) C library that implements the Aho-Corasick algorithm.
I would create a huge hashmap with the phrases and words to search against at load time, since searching through hashmaps is very, very fast, especially at these sizes. Obviously a memory-hungry solution, but pretty trivial.
iOS 4 and above seems to have functionality for custom dictionaries; perhaps you could exploit that somehow?

Arrays in Visual Basic

In declaring an array in VB, would you ever leave the zero element empty and adjust the code to make it more user friendly?
This is for Visual Basic 2008
No, I wouldn't do that. It seems like it might help maintainability, but that's a very short-sighted view.
Think about it this way. It only takes each programmer who has to understand and maintain the code a short amount of time to get comfortable with zero-indexed arrays. But if you're using one-based arrays, which are unlike those found in almost all other VB.NET code, and in fact almost every other common programming language, it will take everyone on the team much longer. They'll be constantly making mistakes, tripping up because their natural assumptions aren't accurate in this one special case.
I know how it feels. When I worked in VB 6, I loved one-based arrays. They were very natural for the type of data that I was storing, and I used them all over the place. Perfectly documentable here, because you have an explicit syntax to specify the upper and lower bounds of the array. That's not the case in VB.NET (which is a newer, but incompatible version of the Visual Basic language), where all arrays have to be zero-indexed. I had a hard time switching to VB.NET's zero-based arrays for the first couple of days. After that initial period of adjustment, I can honestly say I've never looked back.
Some might argue that leaving the first element of every array empty would consume extra memory needlessly. While that's obviously true, I think it's a secondary reason behind the one I presented above. Good developers write code for others to read, so I commend you for considering how to make your code logical and understandable. You're on the right path by asking this question. But in the long run, I don't think this decision is a good one.
There might be a handful of exceptions in very specific cases, depending on the type of data that you're storing in the array. But again, failing to do this across the board seems like it would hurt readability in the aggregate, rather than helping it. It's not particularly counter-intuitive to simply write the following, once you've learned how arrays are indexed:
For i As Integer = 0 To (myArray.Length - 1)
'Do work
Next
And remember that in VB.NET, you can also use the For Each statement to iterate through your array elements, which many people find more readable. For example:
For Each i As Integer In myArray
'Do work
Next
First, it is about programmer friendly, not user friendly. User will never know the code is 0-based or 1-based.
Second, 0-based is the default and will be used more and more.
Third, 0-based is more natural to computer. From the very element, it has two status, 0 and 1, not 1 and 2.
I have upgraded a couple of VB6 projects to vb.net. To modify to 0-based array in the beginning is better than to debug the code a later time.
Most of my VB.Net arrays are 0-based and every element is used. That's usual in VB.Net and code mustn't surprise the reader. Readability is vital.
Any exceptions? Maybe if I had a program ported from VB6, so it used 0-based arrays with unused initial elements, and it needed a small change, I might match the pattern of the existing code. Least surprise again.
99 times out of 100 the question shouldn't arise because you should be using List(Of T) rather than an array!
Who are the "users" that are going to see the array indexes? Any good developer will be able to handle a zero-indexed array and no real user should ever see them. If the user has to interact with the array, then make an actually user-friendly system for doing so (text or a 1-based virtual index or whatever is called for).
In visual basic is it possible to declare an array starting from 1, if you find inconvenient to use a 0 based array.
Dim array(1 to 10) as Integer
It is just a matter of tastes. I use 1 based arrays in visual basic but 0 based arrays in C ;)

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.