I have a plist with 3200 dictionaries. Each dictionary has 20 key/values. What's the best way to search through it?
I have a string called "id" and what I am doing right now is, iterating through all the elements of the array, asking each element (dictionary) for the value of key "id", comparing that id with other id i have, and if it's found, break.
This is really slow, like I can see a lag of about 1-2 seconds. Is there a better way?
Thanks
What you're doing now is an O(n) operation (linear in the number of items in the list). You can get a "constant time" O(1) lookup if you keep another "lookaside" data structure that helps you index into your list.
Before you write the 3200 item list of dictionaries, create one more special dictionary that maps your IDs to indexes in the big array. In other words, each key will be an ID and its value will be an NSNumber with the index number into the big array. Then save this also (either in the same plist or a separate one).
Then when you need to do a lookup, just do -objectForKey: in the lookaside dictionary, which will immediately give you back the index of the entry you're looking for.
Just make sure your lookaside dictionary is always in sync if you update them with live data. Note that this also assumes your IDs are unique (it sounds like they are).
Why don't you use a SQLite database?
The first thing I notice is that it seems you're always searching on the same id key. If that's the case, then you should sort your array of dictionaries according to id. You can then do a binary search on the sorted array. Result: finding any dictionary by id takes a maximum of 12 operations. By contrast, a linear search through 3200 items averages 1600 operations and might need as many as 3200.
Core Data might be a very good solution here if you need to search on several different keys, and if all those dictionaries have the same keys. NSManagedObject works a lot like NSMutableDictionary, but the framework will take care of indexing for you, and searching is fast and relatively easy.
Related
I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.
Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...
When creating a key in Redis, I get using the ":" format and treating it similar to a URL structure.
But what if that structure itself contains key-value type combinations? Does one put the key in the structure?
Made-up Example:
Option A) "country:usa:manufacturer:ford:vehicle:f150:color" = black
or
Option B) "usa:ford:f150:color" = black
In some ways, I think that there is strength in the structure of Option A, but it also adds a lot of complexity to the key.
Thoughts?
While keeping in mind your made-up example (do try to use an actual example, you'll get better answers) I would have to say neither.
I would go with an ID for the key, likely an int. then I'd put each key/value pair in your option A as a hash member and value.
For example:
HSET 1 country USA
HSET 1 manufacturer ford
And so on. Or you could use an hmset operation to set them all at once.
Why? You get the benefit of keeping the fields as describing the data (which you lose in your option b), the memory advantages of hashes over strings, and reduced complexity on key structure, not to mention the memory benefits of a short integer as keyname versus a long string.
Further, you have a memory cheap way to create indexes as integer sets. for example a key called "country:1" could be a set of entry IDs which then give you a way to "pull all entries for country ID 1" - USA in the example. By using integers you get the benefit of being able to store these all in a very memory efficient way, at the minor cost of a lookup table. This could even be done in lua to avoid a network hop.
The greater the range of possible combinations and entries, the more valuable the memory savings are. If you've got millions or billions of them, you'll want to follow the integer-ID & lookup route. This would also set you up nicely if you ever need to shard data - either server side or client side.
I have a very large amount of strings in 200 txt files which I'm trying to filter and keep the unique ones only. I was thinking to use NSSet for this, but the problem is that there are 300 millions of string in initial files and I can't load them all into a NSSet because its initializing for a very long time.
Can anybody suggest a better approache or a work around that could help me to solve this problem?
Here a solution that is low cost for memory and cpu consumption :
You can use a sqlite database : create a table with one column string as unique key that will receive each string you are parsing.
During insertion of each string, if string is already in the table it won't be inserted and at the end the table will only contain unique strings.
Make your code in order to keep insertions of strings on insertion failure because of an already existing string (duplicate key)
Edit : add also an index on this column because your needs concerns a lot of entries
Maybe you could just keep the unique ones in memory. As long as you parse the files you can compare each string readed with the ones that are in the unique array and if there are no match add it to the array. But maybe this is not a very good solution because if you have a lot of unique strings this could lead to many comparisons and this could take some time either.
But give it a try, measure the time of execution and see if this work for your case.
Redis.io
The main features of Redis Lists from the point of view of time
complexity is the support for constant time insertion and deletion of
elements near the head and tail, even with many millions of inserted
items. Accessing elements is very fast near the extremes of the list
but is slow if you try accessing the middle of a very big list, as it
is an O(N) operation.
what is the LIST alternative when the data is too high and writes are lesser than Reads
This is something I'd definitely benchmark before doing, but if you're really hitting a performance issue accessing items in the middle of the list, there are a couple of alternatives that really depend on your use case.
Don't make a list so big, age out/trim pieces that don't matter any more.
Memoize hot sections of the list. If a particular paginated range is being requested much more often than others, make that it's own list. Check if it exists already, and if it doesn't create a subset of your list in the paginated range.
Bucket your list from the beginning into "manageable sizes" (for whatever your definition of managable is). If a list is purely additive (no removal from the list), you could use the modulus index of an item as part of the key so that your list is stored in smaller buckets. Ex: key = "your_key_name_" + index % 100000