Azure storage - How does blob versioning affect listing operations? - azure-storage

From microsoft docs, it seems that too many versions will affect the speed of listing operations.
Enabling versioning for data that is frequently overwritten may result in increased storage capacity charges and increased latency during listing operations. To mitigate these concerns, store frequently overwritten data in a separate storage account with versioning disabled.
Will creating a new version do a list operation, or lock something?
what SumanthMarigowda-MSFT said may be List Blobs:
include={snapshots,metadata,uncommittedblobs,copy,deleted,tags,versions,
deletedwithversions,immutabilitypolicy,legalhold,permissions}
Optional. Specifies one or more datasets to include in the response:
-versions: Version 2019-12-12 and newer. Specifies that Versions of blobs should be included in the enumeration.

Creating a new version does not perform a listing operation, or lock anything. It's just that if you're making lots of updates to a blob, you're going to be creating lots of versions.
More versions means increased capacity charges. And more versions means longer listing times when you list blobs and include versions.

Related

KeyStore service with hard limits

I'm putting together a datastore technology that relies on a backing key-value store, and has Typescript generator code to populate an index made of tiny JSON objects.
With AWS S3 in mind as an example store, I am alarmed at the possibility of a bug in my Typescript index generation simply continuing to write endless entries with unlimited cost. AWS has no mechanism that I know of to defend me from bankruptcy, so I don't want to use their services.
However, the volume of data might build for certain use cases to Megabytes or Gigabytes and access might be only occasional so cheap long term storage would be great.
What simple key-value cloud stores equivalent to S3 are there that will allow me to define limits on storage and retrieval numbers or cost? In such a service, a bug would mean I eventually start seeing e.g. 403, 429 or 507 errors as feedback that I have hit a quota, limiting my financial exposure. Preferably this would be on a per-bucket basis, rather than the whole account being frozen.
I am using S3 as a reference for the bare minimum needed to fulfil the tool's requirements, but any similar blob or object storage API (put, get, delete, list in UTF-8 order) that eventual starts rejecting requests when a quota is exceeded would be fine too.
Learning the names of qualifying systems and the terminology for their quota limit feature would give me important insights and allow me to review possible options.

Foundry Data Storage Optimization

Hi I have a general question about pipelines optimization in order to lower storage space.
Does deleting trashed datasets help alleviate disk storage? Ex. Remove obsolete datasets: a.) based on business knowledge and utilization and b.) datasets in the trash.
Also, We'd like to manage the copies of datasets that are stored when a schedule runs. We believe that if we ever had to fall back to a previous version, we only need to reference the latest one, as opposed to keeping multiple copies.
Does this affect storage? And is there a way to manage configuration on this?
Deleting trashed datasets (in typical setups) will result in their underlying files being deleted, but typically a larger driver of storage consumption is the set of previous dataset views kept by default.
You can control the length of time these files and views are kept using the Foundry Retention service. I'd recommend you consult with platform documentation and your support team for configuration of this service.
Retention will compute and mark files matching your configuration for deletion and periodically delete them, thus reducing your storage consumption.

Splitting Sensenet content repository into multiple databases

Is there a way of splitting a Content repository into multiple databases? There is a great chance I'll have TBs of data, maybe even tens of TBs of data. Maintaining database bigger than 1 TB becomes an issue, so I can't imagine dealing with a bigger database. I've considered using Filestream, but having multiple databases would be much more viable solution.
If not, is there at least a way of having several repositories contained in a single web site?
Currently (as of version 7.2) sensenet requires a central database to connect to, you cannot split that into multiple parts.
There is the blob storage feature however that lets you store binaries outside of the main metadata database. You choose a blob storage implementation (e.g. the MongoDb blob provider), install it and you can start uploading files to sensenet. Binaries above a certain (configured) size will go to the external provider.
You'll have to take care of the backup of the blob storage though, because that is different for every db provider. At least the size of the metadata db will be significantly lower.

What to do with old files of the SoftIndexFileStore in Infinispan persistent cache store?

I have a clustered cache store set up with Infinispan (8.2.4 Final) using the SoftIndexFileStore for persistence.
The documentation states that if entries expire it's not possible for the Compactor to cleanup purged entries and the disk usage will grow overtime. From the userguide:
When entries are stored with expiration, SIFS cannot detect that some
of those entries are expired. Therefore, such old file will not be
compacted (method AdvancedStore.purgeExpired() is not implemented).
This can lead to excessive file-system space usage.
Most of my entries expire but there are some which need to persist indefinitely meaning I can't simply run a cleanup job every once in while to delete all the data files.
How to deal with this wasted disk usage? After several weeks of running I see many files which haven't been modified in weeks. Is it safe to delete old files which haven't been modified e.g. less than a month ago?
No; old files won't ever be modified again (they are written once and then considered immutable until removed). Removing them manually could lead to failures since these files are referenced in the index.
Regrettably, when the store is iterated and the entries are found expired, the Compactor.free() is not called, because there could be multiple concurrent iterations and we could end up calling it many times for single entry.
A proper solution would be implementing a periodic (or JMX-triggered) process that goes through old files, computes space occupied by expired entries and schedules files that exceed some threshold for compaction. This should go into Compactor. Please see SIFS javadoc for general design description.
If you're interested in developing this feature and you want to discuss that more, please go to Infinispan forum.

SQL Server 2005, Caches and all that jazz

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.