I'm using VB.Net, and I have a set of data which I have to able to filter through fairly quickly. Basically, the program is like google sugest, but instead of a drop-down menu, I'm using a listbox. When a user enters a word, I compare the word using LINQ and filter those that contain the user's input. The data are all strings of variable length (from 0 to 200 characters, most on 150 character mark), and I have 240,000+ of this strings and counting- all stored in an XML file.
A colleague of mine told me that loading all of that to memory (using VB.Net's XML serializer plus collections of string/objects) is not practical, and would slow the 'startup' time of the program. I haven't finished building the program yet and I'm having second thoughts about continuing this path.
So, my question is: Should I continue with my current approach on the problem (which is load everything to memory on startup), or is there a better way of solving my dilemma?
If you want to prevent startup time and keeping it in memory isn't an issue on performance, then load it asynchronously. Although loading 240.000+ strings from an XML and keeping it in memory doesn't sound like the greatest idea. Probably a database would be the better approach. Or at least some format like JSON that's faster to parse.
Depends on a number of things:
If
((you know the strings will not hugely increase in number) &&
(you know the spec of the machines that will run your app) &&
(you are able to test that the load time is *good enough* on the above spec))
{
**don't bother changing approach.**
}
else
{
**change approach.**
}
The alternative approach is obviously some kind of asynch lazy-load.
You're talking about loading roughly 36MB of strings. While this isn't a daunting amount by any means (though you could probably load it faster reading the XML yourself...I wouldn't go with the serialization engine if I was worried about performance), it's also a non-trivial amount. You're looking a adding a couple of seconds to your startup time, assuming you don't do it asynchronously as Mircea suggests.
If you do do it asynchronously, you'll have to ensure that any UI process that relies on the data doesn't occur until after it has loaded. That may be a difficult thing to ensure.
The question seems to imply an online application. A few suggestions if that is the case:
The data could / should be zipped. I suspect it would compress very nicely.
Maybe the data could be cached accross multiple sessions, possibly be delivered as html content with a expiry cache date as appropriate. This would save systematic loading, and may be feasible if the data isn't updated frequently.
The suggestion feature feature could be initially disabled (i.e. say showing a "loading..." message while the application initializes the cache, asynchronously). In this fashion the application would be quickly available upon startup, even though the suggest feature may lag by up to say 30 seconds or so.
Edit: Independently of how the data gets downloaded and cached, I second the opinion of Mircea Grelus that an xml file of this size is a poor substitute for a database.
It may not be a bad idea to load the XML into memory when the app starts up. But if you go this route I'd look into using the BackgroundWorker thread. The idea would be to load the XML into memory asynchronously so the UI is still responsive as this is going on. As far as the user is concerned the app shouldn't appear to start any slower, and yet once done the Google-suggest-like feature should be significantly faster.
I must say that even in memory this is an inherently inefficient operation since you have no advantage of using an index when querying an XML file in this way. This is something that would be 10X faster in SQL with full-text searching.
Of course XML has the advantage of being self-contained and requiring no additional components. And that makes it a decent choice for small desktop apps that query small amounts of data. Otherwise I would consider using a database for better performance.
You might be better served by using binary serialization rather than XML serialization to persist the data that your app reads on startup, particularly if you end up implementing a data structure that's faster to search than a `StringCollection. You'd still maintain the XML version of the data somewhere, of course.
And by all means, use a BackgroundWorker to load the data asynchronously if that'll make your application feel more responsive.
Related
when one should use NSInMemoryStoreType?because it stores only in memory then how it will be useful for persistence?
Since it persists data in-memory, when you close the app, the data is lost for sure, and they can't be recovered.
But there is an advantage in using NSInMemoryType, and that is, speed. If you're going to save 100,000 data using NSInMemoryType, it's going to take 45 seconds to add them and getting the counts back using CountResultType.
Doing so using NSSQLiteStoreType actually takes more than 3-4 mins and thus it doesn't seem to be the best way to store all that data; unless post-application-relaunch persistence is a must.
A good use-case for In Memory would be an application which it's data changes every time the app launches or closes and there is more than 1,000 records to store. In that case, it's wise to persist the records in memory, since it's blazingly fast compared to NSSQLiteStoreType.
On the other hand, if the number of records to store is not much, there won't be any speed difference between SQLite and InMemory, and storing records using NSSQLiteMemoryType would be a much better choice.
I've compared the two and here's the results:
The in memory store type does not "persist" because it is not writing to a file.
I had a good use case recently: an app reads a complex XML file and translates it into a Core Data object graph. The code is very readable and can take advantage of useful objects such as NSFetchedResultsController.
The only real possible argument for using the Core Data SQLite store was for allowing faster startup (not necessary to read the XML file), but that was not a project requirement.
The argument for the in-memory store is that it is much faster than a database backed persisted store.
It's very useful if you already have a CoreData database with lots of code working with it, you want to extract some small amount of data, and use all your existing code with it, which means you need another CoreData database. Instead of having to create a new database in the file system with all the overhead involved, you can just create it in memory.
I understand what serialized is. I simply do not know when I would use it. I have seen the discouraged practice of session data in a database and things like that but other than that I do not know.
What kind of objects state would I save in a database, file system, anything that needs persistence? Why would I use it for a non-"permanent" reason?
I do not have a context per se. All I really do are client server web apps. I may get to use a Java stack for it, but I'd really like to understand this part of things, should I need it.
I have asked similar questions. I'm just not understanding.
In a sentence, using a generic serialiser is a reasonable way to save stuff to disk, move stuff over a network in a manner which doesn't require you to design a data format, write code that emits data in that format, and write a parser for that format (all error-prone) by hand.
Any time you want to persist an object (or object hierarchy) beyond its existence inside a single execution on a single machine, you are going to want to serialise and deserialise.
Some scenarios that come to my mind are
Caching: when you want to offload in-memory objects to disk (the caching framework can serialise the object to disk)
For thick clients (either a desktop application or an app using RMI) you'll need to transfer objects from one JVM to another, and this is done by serialising them
I can't think of any other scenarios from the top of my head.
I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions
The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.
There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.
I created this simple textpad program in WPF/VB.NET 2008 that automatically saves the content of the forms to an XML file on every keystroke.
Now, I'm trying to make the program see the changes on the XML file in realtime.. example, If I open two of my textpads, when I write on the first one, it will automatically reflect on the other textpad.
How can I do this?
One of my colleagues told me to read about iNotifyPropertyChanged (which I did) but how can I apply it to my application..?
:( help~
btw, I got the idea from a Google Wave demo, and I'm actually trying to do something bigger..
Note - this approach will be really, really expensive in terms of disk I/O, memory usage and CPU time. Why are you using XML is that the native format of the data you are editing? You may want to look at a more compact format - one that will use less memory, generate fewer I/Os and use less CPU.
Also note that you writer may need to flush the file for the watcher to notice any changes. This is expensive as well - especially if you re doing it on every key stroke.
Be sure to use the correct file open attributes (sharing, reading and writing).
You may want to consider using shared memory to communicate between your processes. This will be less expensive. You can avoid large ammounts of disk I/O by only writing changes to disk when the use asks to commit them, or there is a hint to do so. I suggest avoiding doing this on every key stroke.
Remember, your app needs to be a good system citizen and consume a reasonable amount of system resources. This is especially true running on netbooks and other 'low spec' systems.
You will probably need to use the FileSystemWatcher to watch the file on the disk rather than a property in the running instance of the application.
Or you could use some custom message passing between different instances of your application.
INotifyPropertyChanged isn't going to work for your application. That interfaced is used when data binding some element to a UI object.
Your best bet is going to be to attach a FileSystemWatcher to the file when you open it for editing. You can then use the change events to reload the file as needed in each instance of your application.
This will also load changes made from external editors.
It sounds like you are using file IO as a form of interprocess communication, if so, IMO you need to rethink your design, especially if you are doing something "bigger" than google wave (whatever bigger means in this context) as what you are proposing is terribly ineficient.
Do some searching on Interprocess communication and you will get a whole bunch of idea's #foredecker's idea (+1) of shared memory is a good possibility for example.
What am I trying to do?
A UI process that reads data from a Core Data store on disk. It wouldn't need to edit the data, just read and display the data.
A command line process that writes to the same data store as accessed by the UI.
Why?
So that the command line process can be running all the time but the user can quit the UI process and forget about the app until they need to look at the data it's captured.
What would be the simplest and most reliable way of achieving this?
What Have I Tried?
I've read up on sharing a data store between threads and implemented this once before, but I can't find anything in the docs or on the web indicating how to share a store between processes.
Is it as simple as pointing both processes at the same data store file? I've experimented with this briefly. It appeared to work OK, but I'm worried I might run into problems with locking etc when it's really put under stress.
Finally
I'd really appreciate someone giving me pointers on what direction to go with this. Thanks.
This might be one of those situations in which you'll simply have to Try It And Seeā¢.
Insofar as I can remember, SQLite (which is the data store you'll most likely want to be using) has built in mechanisms for file locking and so on; so the integrity of the file is likely to be assured. If, on the other hand, you use the CoreData/XML approach, you might run into problems.
In other words; use the SQLite backing for your file, and you should likely be fine.
You can do exactly what you want, you probably want to use the SQLite store otherwise saving and committing every time you want to synch out data will be horrifically slow. You just need to use some sort of IPC doorbell between the apps so that you can inform one app it needs to recheck the persistent store on disk and merge in its data.
Apple documents using multiple persistent store corindators as a valid option in Multi-Threading with Core Data (in "General Guidelines", open 2). That happens to be discussing completely parallel CD stacks in the same process, but it is valid if they are in completely separate address spaces as well.
Nearly two years on, and I've just found a much better way of doing this.
The answer seems to lie with Sync Services. I didn't even realise it existed! There's an excellent post about this at:
http://www.timisted.net/blog/archive/core-data-and-sync-services/
I've not tried this with my app yet, but it seems like an excellent way of sharing a core data store between two processes or applications.
If I experience any performance issues, I'll update this answer accordingly, but this seems like the Apple recommended way of doing it.
You need to re-think your architecture. If you want a daemon to own the data store, then have your GUI app connect to the daemon. Trying to share the data store is a can of worms you don't want to open.