dotMemory - finding objects with a short lifespan - dotmemory

How can I use dotMemory to identify all objects that were created and then collected either as of a snapshot or between two snapshots? It seems like it should be able to but I can't find anywhere that this is discussed (or I don't know the right terms to search with).

You need a memory traffic view. Note that memory traffic data can't be collected when dotMemory is attached to the already running application due to MS Profiling API restriction.


Infinispan keyset() not suitable for production

I decided to use infinispan distributed grid to extend my application to support cluster but I encountered a limitation when using this kind of shared resource.
How can I retrieve all the values or keys in the Distributed cache? I'm asking this because in their documentation all the collection methods are not recommended for running in production (meaning keySet()).
Right now I have a local bucket/cache with the pairs key/value but in order to process the values I need to retrieve the keys and iterate throught the set.
Set set = cache.keySet();
When having a large number of entries in the local cache, the keySet() returns a copy and this is a heavy load for the memory.
I tried to use the query feature but there are some network calls if I want to find the values and I don't need that. Also the query feature does not support complex filters.
Do you know which is the best approach when using infinispan in production?
As this is an experimental phase I'm using the last infinispan version.
Thanks a lot.
Map/Reduce functionality allows you to iterate over all the entries stored and also migrates the logic where the data is, so doesn't add a lot of burden.
We are using keySet() on production for informational purpose only. Performance do not seem to be a big issue under low data loads but of course you should use such methods with great care because they could have large performance impact depending by how you are using the cache. Remote cache queries seems a pretty handy feature to me.

Provide example for why it is not advisable to store images in CoreData?

this question has been asked many times, I have read many users telling that it is not advisable to store images in a DB, in particular within CoreData. By they all seems to omit the reason why they would do so. Even Apple documentation state this, and everybody points to that direction, and every discussion end like this "well you can, but storing the path is better".
Apart from opinions, I would like to have a concrete example of why it is not a good solution.
I explain better, I have a strong background in building Web Application. A concrete example I would give from my point of view could be: do not store images in a DB, but rather the path to them, because you can have them served them by the web server, which can apply all of its caching issues.
But in a desktop environment, especially in iOS application, what are the downside of having stored in Core Data using sqllite, providing that:
There's a separate entity holding the images, it is not an attribute
of main entity
Also seems to be a limit of 100kb for images. Why ? What does happen with a 110,120...200kb ecc ?
There's nothing special about what Core Data normally does here. It's just using an SQLite database. You can put large blobs of data into it, but it just doesn't scale all that well. You can read more about it here: Internal Versus External BLOBs in SQLite.
That said, Core Data has support for external blobs which in Core Data terminology is called stored in external record (iOS 5.0 and later). Again, there's nothing magic about it, it's just storing the large pieces of data in the file system separately from the SQLite db itself. The benefit is that Core Data updates all this for you.
When you're in Xcode, there'll be a checkbox called Allows External Storage that you can check for Binary Data properties.
The filesystem, and the API:s surrounding it is (just like a webserver) optimized to serve files, of any size, and to apply caching where appropriate.
CoreData is optimized for handling an object graph with tiny pieces of data, like integers and short strings.
Also, there are a number of other issues that tend to creep up on you, like periodically vacuuming the SQLite database CoreData uses, or it won't be able to shrink, just grow.
With Lion/iOS 5, Core Data started handling file system storage of large BLOBs for you.
The choice is really determined by how many images you are going to have open. If you have many, then you should keep them in the DB. Why? Because you only have a modest number of file descriptors, one of which is used for each open image stored in the file system.
That said, there is still a reason to manage the files yourself. If your BLOBs are really big, say 2+ MB, you will want to map them into memory and not just read them in. (When the memory warnings come, this lets the OS automatically purge them from your resident memory. This is a very good thing.) Even so, you still have the limited number of file descriptors problem.

Erlang ETS tables versus message passing: Optimization concerns?

I'm coming into an existing (game) project whose server component is written entirely in erlang. At times, it can be excruciating to get a piece of data from this system (I'm interested in how many widgets player 56 has) from the process that owns it. Assuming I can find the process that owns the data, I can pass a message to that process and wait for it to pass a message back, but this does not scale well to multiple machines and it kills response time.
I have been considering replacing many of the tasks that exist in this game with a system where information that is frequently accessed by multiple processes would be stored in a protected ets table. The table's owner would do nothing but receive update messages (the player has just spent five widgets) and update the table accordingly. It would catch all exceptions and simply go on to the next update message. Any process that wanted to know if the player had sufficient widgets to buy a fooble would need only to peek at the table. (Yes, I understand that a message might be in the buffer that reduces the number of widgets, but I have that issue under control.)
I'm afraid that my question is less of a question and more of a request for comments. I'll upvote anything that is both helpful and sufficiently explained or referenced.
What are the likely drawbacks of such an implementation? I'm interested in the details of lock contention that I am likely to see in having one-writer-multiple-readers, what sort of problems I'll have distributing this across multiple machines, and especially: input from people who've done this before.
first of all, default ETS behaviour is consistent, as you can see by documentation: Erlang ETS.
It provides atomicity and isolation, also multiple updates/reads if done in the same function (remember that in Erlang a function call is roughly equivalent to a reduction, the unit of measure Erlang scheduler uses to share time between processes, so a multiple function ETS operation could possibly be split in more parts creating a possible race condition).
If you are interested in multiple nodes ETS architecture, maybe you should take a look to mnesia if you want an OOTB multiple nodes concurrency with ETS: Mnesia.
(hint: I'm talking specifically of ram_copies tables, add_table_copy and change_config methods).
That being said, I don't understand the problem with a process (possibly backed up by a not named ets table).
I explain better: the main problem with your project is the first, basic assumption.
It's simple: you don't have a single writing process!
Every time a player takes an object, hits a player and so on, it calls a non side effect free function updating game state, so even if you have a single process managing game state, he must also tells other player clients 'hey, you remember that object there? Just forget it!'; this is why the main problem with many multiplayer games is lag: lag, when networking is not a main issue, is many times due to blocking send/receive routines.
From this point of view, using directly an ETS table, using a persistent table, a process dictionary (BAD!!!) and so on is the same thing, because you have to consider synchronization issues, like in objects oriented programming languages using shared memory (Java, everyone?).
In the end, you should consider just ONE main concern developing your application: consistency.
After a consistent application has been developed, only then you should concern yourself with performance tuning.
Hope it helps!
Note: I've talked about something like a MMORPG server because I thought you were talking about something similar.
An ETS table would not solve your problems in that regard. Your code (that wants to get or set the player widget count) will always run in a process and the data must be copied there.
Whether that is from a process heap or an ETS table makes little difference (that said, reading from ETS is often faster because it's well optimized and doesn't perform any other work than getting and setting data). Especially when getting the data from a remote node. For multple readers ETS is most likely faster since a process would handle the requests sequentially.
What would make a difference however, is if the data is cached on the local node or not. That's where self replicating database systems, such as Mnesia, Riak or CouchDB, comes in. Mnesia is in fact implemented using ETS tables.
As for locking, the latest version of Erlang comes with enhancements to ETS which enable multiple readers to simultaneously read from a table plus one writer that writes. The only locked element is the row being written to (thus better concurrent performance than a normal process, if you expect many simultaneous reads for one data point).
Note however, that all interaction with ETS tables is non-transactional! That means that you cannot rely on writing a value based on a previous read because the value might have changed in the meantime. Mnesia handles that using transactions. You can still use the dirty_* functions in Mneisa to squeeze out near-ETS performance out of most operations, if you know what you're doing.
It sounds like you have a bunch of things that can happen at any time, and you need to aggregate the data in a safe, uniform way. Take a look at the Generic Event behavior. I'd recommend using this to create an event server, and have all these processes share this information via events to your server, at that point you can choose to log it or store it somewhere (like an ETS table). As an aside, ETS tables are not good for peristent data like how many "widgets" a player has - consider Mnesia, or an excellent crash only db like CouchDB. Both of these replicate very well across machines.
You bring up lock contention - you shouldn't have any locks. Messages are processed in a synchronous order as they are received by each process. In fact, the entire point of the message passing semantics built into the language is to avoid shared-state concurrency.
To summarize, normally you communicate with messages, from process to process. This is hairy for you, because you need information from processes scattered all over the place, so my recommendation for you is based of the idea of concentrating all information that is "interesting" outside of the originating processes into a single, real-time source.

Most optimized way to store crawler states?

I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions
The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.
There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.

Is there a way of sharing a Core Data store between processes?

What am I trying to do?
A UI process that reads data from a Core Data store on disk. It wouldn't need to edit the data, just read and display the data.
A command line process that writes to the same data store as accessed by the UI.
So that the command line process can be running all the time but the user can quit the UI process and forget about the app until they need to look at the data it's captured.
What would be the simplest and most reliable way of achieving this?
What Have I Tried?
I've read up on sharing a data store between threads and implemented this once before, but I can't find anything in the docs or on the web indicating how to share a store between processes.
Is it as simple as pointing both processes at the same data store file? I've experimented with this briefly. It appeared to work OK, but I'm worried I might run into problems with locking etc when it's really put under stress.
I'd really appreciate someone giving me pointers on what direction to go with this. Thanks.
This might be one of those situations in which you'll simply have to Try It And Seeā„¢.
Insofar as I can remember, SQLite (which is the data store you'll most likely want to be using) has built in mechanisms for file locking and so on; so the integrity of the file is likely to be assured. If, on the other hand, you use the CoreData/XML approach, you might run into problems.
In other words; use the SQLite backing for your file, and you should likely be fine.
You can do exactly what you want, you probably want to use the SQLite store otherwise saving and committing every time you want to synch out data will be horrifically slow. You just need to use some sort of IPC doorbell between the apps so that you can inform one app it needs to recheck the persistent store on disk and merge in its data.
Apple documents using multiple persistent store corindators as a valid option in Multi-Threading with Core Data (in "General Guidelines", open 2). That happens to be discussing completely parallel CD stacks in the same process, but it is valid if they are in completely separate address spaces as well.
Nearly two years on, and I've just found a much better way of doing this.
The answer seems to lie with Sync Services. I didn't even realise it existed! There's an excellent post about this at:
I've not tried this with my app yet, but it seems like an excellent way of sharing a core data store between two processes or applications.
If I experience any performance issues, I'll update this answer accordingly, but this seems like the Apple recommended way of doing it.
You need to re-think your architecture. If you want a daemon to own the data store, then have your GUI app connect to the daemon. Trying to share the data store is a can of worms you don't want to open.