We are in the process of creating a piece of software to backup a storage account (blobs & tables, no queues) and while researching how to do this we came across the possibility storage logging. We would like to use this feature to do smart incremental backups after an initial full backup. However in the introductory post for this feature here the following caveat is mentioned:
During normal operation all requests are logged; but it is important to note that logging is provided on a best effort basis. This means we do not guarantee that every message will be logged due to the fact that the log data is buffered in memory at the storage front-ends before being written out, and if a role is restarted then its buffer of logs would be lost.
As this is a backup solution this behavior makes the features unusable, we can't miss a file. However I wonder if this has changed in the meantime as Microsoft has built a number of features on top of it like blob function triggers and very recently their new Azure Event Grid.
My question is whether this behavior has changed in the meantime or are the logs still on a best effort basis and should we stick to our 'scanning' strategy?
The behavior for Azure Storage logs is still same. For your case, you might be better off using the EventGrid notification for Blob storage: https://azure.microsoft.com/en-us/blog/introducing-azure-event-grid-an-event-service-for-modern-applications/
Related
Use Case:
Say I wanted to create a realtime-collaborative document editing system.
In this scenario many users could create and collaborate on many documents.
Due to client-device constraints, it's not possible for any client to keep a replica of all documents, only just a handful.
There needs to be central storage server where all documents always live, and this server is always backed up.
Each client can 'subscribe' to any document, and all clients subscribed can see realtime changes of all other clients subscribed/editing the same document.
Questions:
Since each client can't store all documents, there needs to be a way to remove the replicas of 'old' documents from the client, without deleting the document from the central store, ideally based on an automatic least-recently-used approach. How is this handled in gun?
In gun, how can a document be deleted from the central store, so it's then effectively permanently removed from, and no longer accessible to, all clients?
When a document is deleted from the central store, how is the physical storage space ever actually reclaimed for later use?
Great questions, #user2672083 . Here is the current lay out:
Collaborative realtime document editing is possible with gun. Here is a quick prototype I recorded a long time ago, however there are no full pre-built examples/implementations yet.
Not all data is stored on every client by default. The browser only keeps the data it requests/gets/subscribes to.
The default server already acts as a backup. I recommend using the S3 storage adapter, because then you do not have to worry about running out of disk space.
Removing old replicas. Currently, if I want the server to act as a central "master", I just put a localStorage.clear() at the top of my browser code. This will force the browser to have to always load the latest from the server. This is not ideal though, an LRU specific feature is coming soon according to the roadmap.
Permanently removing data and reclaiming space. While this should be easy for a central setup, because gun is P2P by default, it uses a technique called tombstoning to delete data. Given a lot of requests (like yours) for LRU/TTL/GC/deleting, there will be better support for this in the future. Currently, you have to use a mix of rm data.json, localStorage.clear() and 30 day lifecycles on S3 to get this to work. This will be more integrated/easier in the future.
Now a question for you: What are you working on, and how can I help? Many of the things you asked about are possible (with some work) now, but slated to be the focus of the next version of gun - I'd love to get your feedback as we build this out.
All peers reply to requests for data (#2), meaning that localStorage and the server will both reply. Because localStorage is physically closer to a user, it will reply first/fastest and then replies from the server will be merged. GUN does not try each peer "in sequence" doing try/catch cascades, GUN replies from all peers in parallel.
GUN has swappable storage and transport interfaces, so yes, it is easy to build other layers on top or into it.
In my node-based app I plan to use redis for a number of purposes, basically interprocess pub/sub communication and a on-line users cache. This application is clear.
I am thinking about where to store the basic app's system configuration. Critical elements like main TCP port, default message channel, database name, admin-user password, etc. would go here.
The traditional choice for the implementation of this kind of thing would be a conf-file, maybe a JSON-structure in a plain text file.
I am wondering however if it would make sense to use redis here. The major issue is data reliability and the risk of loss of data.
What are the pros/cons?
The claims against Redis reliability are because of the async nature of its data persistence and replication mechanisms. Being async, these can't guarantee that all updates will endure failures - depending on how you configure Redis, there's a (small) chance that the most recent updates will be lost if there's a failure.
That said, in the context of configuration settings storage, this isn't really an issue. Configuration data is immutable most of the time and if you, by some bizarre coincidence, lose you most recent updates to it, recovering the changes manually (I.e. reconfiguring) is usually trivial.
I've been trying to find out ways to improve our nservicebus code performance. I searched and stumbled on these profiles that you can set upon running/installing the nservicebus host.
Currently we're running the nservicebus host as-is, and I read that by default we are using the "Lite" version of the available profiles. I've also learnt from this link:
http://docs.particular.net/nservicebus/hosting/nservicebus-host/profiles
that there are Integrated and Production profiles. The documentation does not say much - has anyone tried the Production profiles and noticed an improvement in nservicebus performance? Specifically affecting the speed in consuming messages from the queues?
One major difference between the NSB profiles is how they handle storage of subscriptions.
The lite, integration and production profiles allow NSB to configure how reliable it is. For example, the lite profile uses in-memory subscription storage for all pub/sub registrations. This is a concern because in order to register a subscriber in the lite profile, the publisher has to already be running (so the publisher can store the subscriber list in memory). What this means is that if the publisher crashes for any reason (or is taken offline), all the subscription information is lost (until each subscriber is restarted).
So, the lite profile is good if you are running on a developer machine and want to quickly test how your services interact. However, it is just not suitable to other environments.
The integration profile stores subscription information on a local queue. This can be good for simple environments (like QA etc.). However, in a highly distributed environment holding the subscription information in a database is best, hence the production profile.
So, to answer your question, I don't think that by changing profiles you will see a performance gain. If anything, changing from the lite profile to one of the other profiles is likely to decrease performance (because you incur the cost of accessing queue or database storage).
Unless you tuned the logging yourself, we've seen large improvements based on reduced logging. The performance from reading off the queues is same all around. Since the queues are local, you won't gain much from the transport. I would take a look at tuning your handlers and the underlying infrastructure. You may want to check out tuning MSMQ and look at the disk you are using etc. Another spot would be to look at how distributed transactions are working assuming you are using a remote database that requires them.
Another option to increase processing time is to increase the number of threads consuming the queue. This will require a license. If a license is not an option you can have multiple instances of a single threaded endpoint running. This requires you shard your work based on message type or something else.
Continuing up the scale you can then get into using the Distributor to load balance work. Again this will require a license, but you'll be able to add more nodes as necessary. All of the opportunities above also apply to this topology.
Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.
I've worked in shops where I've implemented Exception Handling into the event log, and into a table in the database.
Each have their merits, of which I can highlight a few based on my experience:
Event Log
Industry standard location for exceptions (+)
Ease of logging (+)
Can log database connection problems here (+)
Can build report and viewing apps on top of the event log (+)
Needs to be flushed every so often, if alot is reported there (-)
Not as extensible as SQL logging [add custom fields like method name in SQL] (-)
SQL/Database
Can handle large volumes of data (+)
Can handle rapid volume inserts of exceptions (+)
Single storage location for exception in load balanced environment (+)
Very customizable (+)
A little easier to build reporting/notification off of SQL storage (+)
Different from where typical exceptions are stored (-)
Am I missing any major considerations?
I'm sure that a few of these points are debatable, but I'm curious what has worked best for other teams, and why you feel strongly about the choice.
You need to differentiate between logging and tracing. While the lines are a bit fuzzy, I tend to think of logging as "non developer stuff". Things like unhandled exceptions, corrupt files, etc. These are definitely not normal, and should be a very infrequent problem.
Tracing is what a developer is interested in. The stack traces, method parameters, that the web server returned an HTTP Status of 401.3, etc. These are really noisy, and can produce a lot of data in a short amount of time. Normally we have different levels of tracing, to cut back the noise.
For logging in a client app, I think that Event Logs are the way to go (I'd have to double check, but I think ASP.NET Health Monitoring can write to the Event Log as well). Normal users have permissions to write to the event log, as long as you have the Setup (which is installed by an admin anyway) create the event source.
Most of your advantages for Sql logging, while true, aren't applicable to event logging:
Can handle large volumes of data:
Do you really have large volumes of unhandled exceptions or other high level failures?
Can handle rapid volume inserts of exceptions: A single unhandled exception should bring your app down - it's inherently rate limited. Other interesting events to non developers should be similarly aggregated.
Very customizable: The message in an Event Log is pretty much free text. If you need more info, just point to a text or structured XML or binary file log
A little easier to build reporting/notification off of SQL storage: Reporting is built in with the Event Log Viewer, and notification systems are, either inherent - due to an application crash - or mixed in with other really critical notifications - there's little excuse for missing an Event Log message. For corporate or other networked apps, there's a thousand and 1 different apps that already cull from Event Logs for errors...chances are your sysadmin is already using one.
For tracing, of which the specific details of an exception or errors is a part of, I like flat files - they're easy to maintain, easy to grep, and can be imported into Sql for analysis if I like.
90% of the time, you don't need them and they're set to WARN or ERROR. But, when you do set them to INFO or DEBUG, you'll generate a ton of data. An RDBMS has a lot of overhead - for performance (ACID, concurrency, etc.), storage (transaction logs, SCSI RAID-5 drives, etc.), and administration (backups, server maintenance, etc.) - all of which are unnecessary for trace logs.
I wouldn't log straight to the database. As you say, database issues become tricky to log :)
I would log to the filesystem, and then have a job which bulk-inserts from files to the database. Personally I like having the logs in the database in the log run primarily for the scaling situation - I pretty much assume I'll have more than one machine running, and it's handy to be able to effectively have a combined log. (Each entry should state the machine it comes from, of course.)
Report and viewing apps can be done very easily from a database - there may be fewer log-specialized reporting tools out there at the moment, but pretty much all databases have generalised reporting functionality.
For ease of logging, I'd use a framework like log4net which takes a lot of the effort out of it, and is a tried and tested solution. Aside from anything else, that means you can change your output strategy with no code changes. You could even log to both the event log and the database if necessary, or send some logs to one place and some to the other. (I've assumed .NET here, but there are similar logging frameworks for many platforms.)
One thing that needs considering about event logging is that there are products out there which can monitor your servers' event logs (like Microsoft Operations Manager) and intelligently do notification, and gather statistics on their contents.
A "minus" of SQL-based logging is that it adds another layer of dependencies to your application, which may or may not always be acceptable. I've done both in my career. I once or twice even used a MSMQ based message queue to queue log events and empty the queue into a MSSQL database to eliminate the need for my client software to have a connection to the DB.
One note about writing to the event log: that requires certain permissions for your application users that in some environments may be restricted by default.
Where I'm at we do most of our logging to a database, with flat files as backup. It's pretty nice, we can do things like get an RSS feed for an app to watch for a few days when we make a change.