Fastest key/value pair container in Objective-C - objective-c

I am creating a syntax highlighting engine. My need is very specific. Keywords will be associated to their respective attribute array via a pointer. The data structure will look something like:
dict = {
"printf": keyword_attr_ptr
, "sprintf": keyword_attr_ptr
, "#import": special_attr_ptr
, "string": lib_attr_ptr
}
The look-up needs to be very fast as I will be iterating over this list every keypress.
I'm asking this question because I can not find any good documentation regarding how NSDictionary caches (if it does) and looks up values by its keys (does it use a map? a hashmap?). Can I rely on NSDictionary to be optimized to search for keys by strings?
When I was doing something similar a long while ago I used the MFC CMap function with very good results. NSDictionary appears to be the equivalent to CMap but the key type isn't specified and the NSDictionary clearly states that a key can be any type of object. I just want to make sure I can rely on it to return the results extremely fast before I put a lot of energy into this problem.
UPDATE 1
After a day of research, I ask the question on SO and I find the answer immediately after... go figure.
This is the documentation related to Dictionaries:
https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Collections/Articles/Dictionaries.html
It uses a hash table to manage its storage. I guess the short answer is that its almost equivalent to CMap.

Related

SHOW KEYS in Aerospike?

I'm new to Aerospike and am probably missing something fundamental, but I'm trying to see an enumeration of the Keys in a Set (I'm purposefully avoiding the word "list" because it's a datatype).
For example,
To see all the Namespaces, the docs say to use SHOW NAMESPACES
To see all the Sets, we can use SHOW SETS
If I want to see all the unique Keys in a Set ... what command can I use?
It seems like one can use client.scan() ... but that seems like a super heavy way to get just the key (since it fetches all the bin data as well).
Any recommendations are appreciated! As of right now, I'm thinking of inserting (deleting) into (from) a meta-record.
Thank you #pgupta for pointing me in the right direction.
This actually has two parts:
In order to retrieve original keys from the server, one must -- during put() calls -- set policy to save the key value server-side (otherwise, it seems only a digest/hash is stored?).
Here's an example in Python:
aerospike_client.put(key, {'bin': 'value'}, policy={'key': aerospike.POLICY_KEY_SEND})
Then (modified Aerospike's own documentation), you perform a scan and set the policy to not return the bin data. From this, you can extract the keys:
Example:
keys = []
scan = client.scan('namespace', 'set')
scan_opts = { 'concurrent': True, 'nobins': True, 'priority': aerospike.SCAN_PRIORITY_MEDIUM }
for x in (scan.results(policy=scan_opts)): keys.append(x[0][2])
The need to iterate over the result still seems a little clunky to me; I still think that using a 'master-key' Record to store a list of all the other keys will be more performant, in my case -- in this way, I can simply make one get() call to the Aerospike server to retrieve the list.
You can choose not bring the data back by setting includeBinData in ScanPolicy to false.

check if 2 linked list have the same elements regardless of order

Is there any way to check if 2 linked lists have the same elements regardless of order.
edit question:
I have fixed the code and given some more details:
this is the method that compares 2 lists
compare: object2
^ ((mylist asBag) = ((objetc2 getList) asBag)).
the method belongs to the class myClass that has a field : myLList. myList is a linkedList of type element.
I have compiled it in the workspace:
a: = element new id:1.
b:= element new id:2.
c:=element new id:3.
d: = element new id:1.
e:= element new id:2.
f:=element new id:3.
elements1 := myClass new.
elements addFirst:a.
elements addFirst:b.
elements addFirst:c.
elements2 := myClass new.
elements addFirst:d.
elements addFirst:e.
elements addFirst:f.
Transcript show: (elements1 compare:elements2).
so I am getting false.. seems like it checks for equality by reference rather than equality by value..
So I think the correct question to ask would be: how can I compare 2 Bags by value? I have tried the '=='..but it also returned false.
EDIT:
The question changed too much - I think it deserves a new question for itself.
The whole problem here is that (element new id: 1) = (element new id: 1) is giving you false. Unless it's particular class (or superclasses) redefine it, the = message is resolved comparing by identity (==) by default. That's why your code only works with a collection being compared with itself.
Test it with, for example, lists of numbers (which have the = method redefined to reflect what humans understand by numeric equality), and it will work.
You should redefine your element's class' = (and hashCode) methods for this to work.
Smalltalk handles everything by reference: all there exist is an object, which know (reference) other objects.
It would be wrong to say that two lists are equivalent if they are in different order, as the order is part of what a list means. A list without an order is what we call a bag.
The asBag message (as all of the other as<anotherCollectionType> messages) return a new collection of the named type with all the elements of the receiver. So, #(1 2 3 2) is an Array of four elements, and #(1 2 3 2) asBag is a bag containing those four elements. As it's a Bag, it doesn't have any particular order.
When you do bagA := Bag new. you are creating a new Bag instance, and reference it with bagA variable. But then you do bagA := myList asBag, so you lose the reference to the previous bag - the first assignment doesn't do anything useful in your code, as you don't use that bag.
Saying aBool ifTrue: [^true] ifFalse: [^false] has exactly the same meaning as saying ^aBool - so we prefer just to say that. And, as you only create those two new bags to compare them, you could simplify your whole method like this:
compareTo: anotherList
^ myList asBag = anotherList asBag
Read it out loud: this object (whatever it is) compares to another list if it's list without considering order is the same than the other list without order.
The name compareTo: is kind of weird for returning a boolean (containsSameElements: would be more descriptive), but you get the point much faster with this code.
Just to be precise about your questions:
1) It doesn't work because you're comparing bag1 and bag2, but just defined bagA and bagB.
2) It's not efficient to create those two extra bags just because, and to send the senseless ifTrue: message, but other way it's OK. You may implement a better way to compare the lists, but it's way better to rely on the implementation of asBag and the Bag's = message being performant.
3) I think you could see the asBag source code, but, yes, you can assume it to be something like:
Collection>>asBag
|instance|
instance := Bag new.
instance addAll: self.
^instance
And, of course, the addAll: method could be:
Collection>>addAll: anotherCollection
anotherCollection do: [ :element | self add: element ]
So, yes - it creates a new Bag with all the receiver's elements.
mgarciaisaia's answer was good... maybe too good! This may sound harsh, but I want you to succeed if you're serious about learning, so I reiterate my suggestion from another question that you pick up a good Smalltalk fundamentals textbook immediately. Depending on indulgent do-gooders to rework your nonsensical snippets into workable code is a very inefficient way to learn to program ;)
EDIT: The question has changed dramatically. The following spoke to the original three-part question, so I paraphrased the original questions inline.
Q: What is the problem? A: The problem is lack of fundamental Smalltalk understanding.
Q: Is converting to bags an efficient way to make the comparison? A: Although it's probably not efficient, don't worry about that now. In general, and especially at the beginning when you don't have a good intuition about it, avoid premature optimization - "make it work", and then only "make it fast" if justified by real-world profiling.
Q: How does #asBag work? A: The implementation of #asBag is available in the same living world as your own code. The best way to learn is to view the implementation directly (perhaps by "browsing implementors" if you aren't sure where it's defined") and answer your own question!! If you can't understand that implementation, see #1.

How to properly convert to a canonical string for searching in Cocoa?

I have a string field that I know that users will want to search on later. Inspired by the WWDC 2012 Core Data Best Practices session I plan to store a normalized version of the string into a separate field so I can optimize my search predicates.
My primary concern is case insensitivity, but while I'm normalizing strings I figure that I should also normalize the unicode representation. But I want to be sure I use the right normalization form (i.e. C,D,KC or KD). And does it matter whether I convert to lowercase first? (Localization is not my strong suit.)
So:
What are the proper methods to call to do the search normalization of the NSString?
What would be the optimal way to make sure the normalized version is stored.
I will post my first attempt as an answer, but I'd love to hear where I am wrong, other suggestions, or improvements. (Unfortunately while they showed the search predicates in that video, I don't think they showed the code from the session.)
For the use case you describe, it doesn't matter whether you pick precomposed or decomposed (C or D; although you will save a bit of space with precomposed), but think carefully about whether you want canonical or compatibility (K forms). TR15 has a nice figure that summarises the differences (Figure 6):
That is: if someone searches for "ſ" (a 'long s') do you want to match "s" (and vice versa)? These are regarded as "formatting distinctions", so you shouldn't replace the text the user enters with these forms (as you lose data), but you may want to ignore them when searching.
With regard to a case-insensitive comparison, it's not enough to simply make both strings lowercase and compare them. It will work for English, but there are languages where the mapping between lower and uppercase (if such a distinction even exists) is no so clear. The W3C wiki has a nice summary of these "case folding" issues. Unfortunately, you can't optimise this in your storage by keeping the data in one "case", you can only do a proper comparison when you know both strings and the locale.
Luckily, when working with an NSString it's -compare:options:range:locale: lets you specify an NSCaseInsensitiveSearch option and the locale (if you know it), which will handle these case folding problems for you (also take a look at NSDiacriticInsensitiveSearch and NSWidthInsensitiveSearch to see if you want to be agnostic about those differences too).
What I currently plan to do is override the setter for the field, like so:
- (void)setName:(NSString *)value
{
[self willChangeValueForKey:#"name"];
[self setPrimitiveValue:value forKey:#"name"];
[self didChangeValueForKey:#"name"];
//Store normalized for for searching
[self willChangeValueForKey:#"searchName"];
[self setPrimitiveValue:[[value lowercaseStringWithLocale:[NSLocale currentLocale]] decomposedStringWithCompatibilityMapping] forKey:#"searchName"];
[self didChangeValueForKey:#"searchName"];
}
I also made the searchName property read-only.

Find a suitable vocabulary database to build a C structure

Let's begin with the question final purpose: my aim is to build a word-based neural network which should take a basic sentence and select for each individual word the meaning it is supposed to yield in the sentence itself. It is then going to learn something about the language (for example the possible correlation between two given words, what is the probability to find both in a single sentence and so on) and at the final stage (after the learning phase) try to build some very simple sentences of its own according to some input.
In order to do this I need some kind of database representing a vocabulary of a given language from which I could extract some information such as word list, definitions, synonyms et cetera. The database should be structured in a way such that I can build C data structures containing the needed information such as
typedef struct _dictEntry DictionaryEntry;
typedef struct _dict Dictionary;
struct _dictEntry {
const char *word; // Word string
const char **definitions; // Array of definition strings
DictionaryEntry **synonyms; // Array of pointers to synonym words
Dictionary *dictionary; // Pointer to parent dictionary
};
struct _dict {
const char *language; // Language identification string
int count; // Number of elements in the dictionary
float **correlations; // Correlation matrix between i-th and j-th entries
DictionaryEntry *entries; // Array of dictionary entries
};
or equivalent Obj-C objects.
I know (from Searching the Mac OSX system dictionaries?) that apple provided dictionaries are licensed so I cannot use them to create my data structures.
Basically what I want to do is the following: given an arbitrary word A I want to fetch all the dictionary entries which have a definition containing A and select such definition only. I will then implement some kind of intersection procedure to select the most appropriate definition and synonyms based on the rest of the sentence and build a correlation matrix.
Let me give a little example: let us suppose I type a sentence containing "play"; I want to fetch all the entries (such as "game", "instrument", "actor", etc.) the word "play" can be correlated to and for each of them select the corresponding definition (I don't want for example to extract the "instrument" definition which corresponds to the "tool" meaning since you cannot "play a tool"). I will then select the most appropriate of these definitions looking at the rest of the sentence: if it contains also the word "actor" then I will assign to "play" the meaning "drama" or another suitable definition.
The most basic way to do this is scanning every definition in the dictionary searching for the word "play" so I will need to access all definitions without restrictions and as I understand this cannot be done using the dictionaries located under /Library/Dictionaries. Sadly this work MUST be done offline.
Is there any available resource I can download which allows me to get my hands on all the definitions and fetch my info? Currently I'm not interested in any particular file format (could be a database or an xml or anything else) but it must be something I can decompose and put in a data structure. I tried to google it but, whatever the keywords I use, if I include the word "vocabulary" or "dictionary" I (pretty obviously) only get pages about the other words definitions on some online dictionary site! I guess this is not the best thing to search for...
I hope the question is clear... If it is not I'll try to explain it in a different way! Anyway, thanks in advance to all of you for any helpful information.
Probably an ontology which is free, like http://www.eat.rl.ac.uk would help you. In the university sector there are severals available.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).