How to store huge amount of NSStrings for comparison purpose

How to store huge amount of NSStrings for comparison purpose - objective-c

I am writing a (linguistic) Morphology Mac Application. I often have to check if the Words in a given Text are in a huge List of Words (~1.000.000).
My Question is: How do i store these Lists ?
I use a .txt File to store the Words and create an NSSet from this File, which survives as long as the Application is launched.
I use a Database like SQLite.
Some points:
I think the focus should be on speed, because the analysis is triggered by the user and this comparisons make the largest part of the computation.
The Lists may change via updates.
I used CoreData and MySQL before, so (i think) i could realize both.
I have read a lot about the pro/cons of Database vs. File but i never thought its my usecase.
I dont know if its relevant which technik i use, because the size of these Files is relatively small (~20MB) and even with a lot of supported Languages, only 3-4 of this files will be loaded into memory at the same time.
Thanks! Danke!

Related

is there an ocaml library store/use data structure on disk

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.

The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/

In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.

there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm

HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.

What are the downsides of plain text files as configuration files with only a few values in iOS?

Why use plists and xml files? If I only want to store a few values, is it okay to use plain text files or does this go against Objective-C best practices?
-------EDIT-------
I'm not sure if this should go in a separate post or not, so I'll just put it here...
If I'm making an app where a user can design a cupcake and save it with their preferences (color, flavor, size), which method should I use. I imagine my users aren't going to make hundreds of designs, but some will inevitably make a large number.

I'd say it's fine, but considering how easy they make it to read plist files, my question would be why bother?

The main reason that plists are commonly used is because the native APIs can handle these easily. You can load a NSArray/NSDictionary directly from a plist with one command.
SQL databases are used when you are going to have many occurrences of similar data. For example, if you need to record contacts for a social app, you would use a database that could contain the id, name, age, gender, phone number, email, etc.
Other than these, there are custom binary formats, but these are specialized for whatever project is being worked on. Depending on what you need to accomplish, a text file could work for you, but there may be better answers. There is nothing wrong with using text files, but these are not commonly used as you would have to write your own methods to parse them. Really, I would need to know more about what type of data you will be storing before I could tell you which option would be best.
An edit to answer your edit:
For this, the best thing to do would be to use a SQL database, as every cupcake is going to have various properties, such as name, type of cake mix used, type of frosting, color of frosting, sprinkles yes/no, etc. This is perfect for a SQL database, because dbs have named colums (ie "name", "mixType", etc), and each row in the table will have different values for each of these columns.
This could also be implemented with a plist, but it wouldn't be as efficient, and it would use more disk space (although not much if youre just using it for cupcakes). I assume that you have a Cupcake class, so you could just implement a load function like this:
+(id)cupcakeWithContentsOfFile:(NSString*)file {
if((self = [super init])) {
NSDictionary* plist = [NSDictionary dictionaryWithContentsOfFile:file];
self.flavor = [plist objectForKey:#"flavor"];
// Etc.
}
return self;
}

This is an interesting question. I actually have an app out there that DOES indeed use simple text files for data that the app uses. Because the original code came from a windows / mac program, and I wanted to keep the data files consistent, Windows doesnt provide pfile type operation. I wrote all of the code on windows to read in the files, and since it was all ANSI-C it transported to MacOS X quite nicely.
When it came to porting it over to the iPhone and iPad, it still was just as easy to port the code then to re-write not only the data files into PList format, but also the PList reading code.
For All apps that I have begun from scratch that arent windows bound also, I have used PLists though.
This is actually an issue that I go back and forth on. PList and / or XML reading and processing is certainly NOT going to be as high of performance as a well designed text file format as both are bloated due to all of the excess tags that may or may not be necessary. If you are trying to develope a single data file that has many types of data, than this could be the simpler approach, but if you are looking for less datasize, and maybe faster execution, then formatting your own file may be much better. I know this isnt a definitive answer.
So in the end, it isnt a "best practices" sort of thing, but really a preference type of thing.

Provide example for why it is not advisable to store images in CoreData?

this question has been asked many times, I have read many users telling that it is not advisable to store images in a DB, in particular within CoreData. By they all seems to omit the reason why they would do so. Even Apple documentation state this, and everybody points to that direction, and every discussion end like this "well you can, but storing the path is better".
Apart from opinions, I would like to have a concrete example of why it is not a good solution.
I explain better, I have a strong background in building Web Application. A concrete example I would give from my point of view could be: do not store images in a DB, but rather the path to them, because you can have them served them by the web server, which can apply all of its caching issues.
But in a desktop environment, especially in iOS application, what are the downside of having stored in Core Data using sqllite, providing that:
There's a separate entity holding the images, it is not an attribute
of main entity
Also seems to be a limit of 100kb for images. Why ? What does happen with a 110,120...200kb ecc ?
thanks

There's nothing special about what Core Data normally does here. It's just using an SQLite database. You can put large blobs of data into it, but it just doesn't scale all that well. You can read more about it here: Internal Versus External BLOBs in SQLite.
That said, Core Data has support for external blobs which in Core Data terminology is called stored in external record (iOS 5.0 and later). Again, there's nothing magic about it, it's just storing the large pieces of data in the file system separately from the SQLite db itself. The benefit is that Core Data updates all this for you.
When you're in Xcode, there'll be a checkbox called Allows External Storage that you can check for Binary Data properties.

The filesystem, and the API:s surrounding it is (just like a webserver) optimized to serve files, of any size, and to apply caching where appropriate.
CoreData is optimized for handling an object graph with tiny pieces of data, like integers and short strings.
Also, there are a number of other issues that tend to creep up on you, like periodically vacuuming the SQLite database CoreData uses, or it won't be able to shrink, just grow.

Leonardo,
With Lion/iOS 5, Core Data started handling file system storage of large BLOBs for you.
The choice is really determined by how many images you are going to have open. If you have many, then you should keep them in the DB. Why? Because you only have a modest number of file descriptors, one of which is used for each open image stored in the file system.
That said, there is still a reason to manage the files yourself. If your BLOBs are really big, say 2+ MB, you will want to map them into memory and not just read them in. (When the memory warnings come, this lets the OS automatically purge them from your resident memory. This is a very good thing.) Even so, you still have the limited number of file descriptors problem.
Andrew

Use an SQL database as a word dictionary

I am creating a mobile game that takes words from users and then validates them to see if they are valid words in the English dictionary. I have created a similar game like this in the past using a dictionary that I loaded into the games local memory.
The problem with that approach was that I would often need to update the dictionary with new words. Since the dictionary was in memory, adding new words required me to completely update the app. If I were to use an SQL database as the dictionary, I could add words very easily without having to update the app and have to rely on users to go and download the new update.
My question is, is there any thing wrong with this approach (design or performance wise)? I have not seen something like this being done before. Also, I don't need definitions. I just need to make sure that the word is a valid English word.
If this is bad design, are there any better alternatives? Or am I better off just dealing with the in memory dictionary?

A SQL database seems overkill. Have you looked at a key-value store like Berkley DB?

The answer depends to a large extent on the overhead of the database for your application. It may take a lot of processing power and memory for adding a small amount of functionality.
If you are already using a file based approach, perhaps the simplest solution is to periodically poll the file to check for updates (size or modify time). When one is found, load it into memory.
The database would be valuable in an environment where the data is too big to fit in memory, because databases do a good job managing memory and disk space.

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?

The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.

For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.

Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas