How to explain the big difference between memory and native store's sizes on disk in Sesame 2.7.7? - sesame

I have a 2.9GB n-triples file, and I managed to load it in both a native sesame repository with (spoc, posc and ospc indexes) (let call it repo_native) and an in-memory sesame repository (let call it repo_memory). I've checked the on-disk size of both repositories in the directory ~/.aduna/openrdf-sesame/repositories and I was surprised that the directory repo_native take 1.8GB while the directory repo_memory only take 125MB.
I don't really have any clue about how to explain that. Is it that persistence in memory repository somehow also used native storage ?
Does some one have an explanation of such a difference in size?
Thanks in advance

There's not sufficient information here to diagnose the problem, but if you're uploading a 2.9GB file to a memory store, and the size of the memory store's storage dir is only 125M, that probably means that your data has not been persisted to disk. You likely have not configured your in-memory repository to sync to disk, or something has gone wrong during upload.
And no, the in-memory store does not use the native store's persistence mechanism, it has its own (far less sophisticated) persistence mechanism.

Related

Gemfire for storing BLOB

We have a requirement where we want to store BLOB files in GemFire. The estimated size of this region would be in TBs because of the BLOB files.
2 Approach we are planning to analyze. Please suggest
1) create the gemfire region with Overflow configuration. This should enable only Key in memory and actual data file overflowed to Gemfire diskstore.
This will also help to control the GemFire region size.
Size of Gemfire disk store however would be huge. Is this option feasible?
2) Store the files on Server disks with the file path stored in Gemfire region.
Use the file paths for directly accessing/updating the files from the client
any other suggested approach for such requirements?
To me option 2 seems brittle. You would have a to do a lot of work to provide HA, handle disk failures, IO exceptions etc. Gemfire will take care of all these things if you use option 1.
What is the size of the blob you are looking to store? Gemfire will not be able to store a single value that is more than 2gb. Handling Terabytes of data in a region should not be a problem.

How to handle index files in a distributed Lucene cluster?

We are using Lucene in our application, and the index files saved in the disk of the same server where the application run.
The index files are almost 2Gb at the moment, and they maybe updated sometime, for example, when new data are inserted into the database, we may have to rebuild that part of index and add them.
So far so good since there is only one application server, now we have to add another to make a cluster, so I wonder how to handle the index files?
BTW, out application should be platform independent, since our clients use different os like Linux, and some of them even use the cloud platform with different storage like Amazon EFS or Azure storage.
Seems I have two opinions:
1 Every server hold a copy of the index files, and the make them synchronized with each other.
But the synchronized mechanism will depend on the OS, we tried to avoid this. And I am not sure if it will cause conflict if two server update the index files with different documents at the sometime.
2 Make the index file shared.
Like 1), the file share mechanism is platform aware. Maybe save them to the database is an alternative, but how about the performance? I have thought to use memcached to save them, but I have not find any examples.
How do you handle this kind of problem?
Possibly you should look into Compass project. Compass allowed to store Lucene index in database and distributed in memory data grids like GigaSpaces, Coherence and Terracotta. Unfortunately this project is outdated and last version was released at 2009. But you can try adapt it for your propose.
Another option, to look at HdfsDirectory that support a storing a index in HDFS file systems. I see only 5 classes in package org.apache.solr.store.hdfs , so it will be relatively easy to adapt them to storing index into in-memory caches like memcached or redis.
Aslo I find a project on github for RedisDirectory, but it initial stage and last commit was at 2012. I can recommend it only for reference.
Hope this help you find a right solution.

The meaning of evict() in infinispan cache

According to the docs for infinispan: http://docs.jboss.org/infinispan/5.0/apidocs/ the evict() API does not remove the entry from any other cache stores in the cluster, on the cache store it was invoked on.
If using "replication" mode, where the data is replicated across the caches, surely it has to be consisted and using the evict() API will make it inconsistent.
How then is the inconsistency resolved?
Thanks
Evict removes the entry only from the memory on the node where you call it. It does not make the cache inconsistent, because if you call cache.get() and the entry is not found in memory, it is loaded from cache store.
As the documentation states, the purpose is to inform cache that it won't use the entry for some time and the node can free some memory.

Should I cache blob content to local HD?

Suppose I have files in blob storage, and these files are constantly used by my web application hosted in Windows Azure.
Should I perform some sort of caching of these blobs, like downloading them to my app's local hard-drive?
Update: I was requested to provide a case to make it clear why I want to cache content, so here it goes: imagine I have an e-commerce web-site and my product images are all high-resolution. Sometimes, though, I would like to serve them as thumbnails (eg. for product listings), and one possible solution for that is to use an HTTP handler to resize the images on demand. I know I could use output-cache so that the image just needs to be resized once, but for the sake of this example, let us just consider I would process the image every time it was requested. I imagine it would be faster to have the contents cached locally. In this case, would it be better to cache it on the HD or to use local-storage?
Thanks in advance!
Just to start answering your question, yes accessing a static content from Role specific local storage would be faster compare to accessing it from Azure blob storage due to network latency even when both compute and blob are in same data center.
There could be a solution in which you can download X amount of blobs from Azure storage during startup task (or a background task) in Role specific Local Storage and reference these static content via local storage however the real question is for what reason you want to cache the content from Azure blob storage? Is it for faster access or for reliability? If reason is to have static content accessible almost immediately then I could think of having it cached at local storage.
There are pros and cons of each approach however if you can provide the specific why would you want to do that, you may get much better to the point response.
Why not use a local resource? It gives you a path to a folder on the HD, and you can get a lot of space. You can even keep it around between restarts.
Another option is Azure Cloud Drive. It's fast, and would allow you to share the cache among instances (but only can write at once).
Erick

SQL Server 2005, Caches and all that jazz

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.