How do I know when to invalidate the cache, if a table change is made from an outside source?
I have an api call that returns an employee table. The first time this call is made, I will cache the results so that on subsequent calls it will pull the data from the cache instead of the database. This makes sense, however, what happens if someone adds a new record to the employee table from outside of the api, how does the cache know that it is now invalid?
If the user made the change to the employee table through the API I can capture that, but we have a separate desktop app that doesn't use the API, and that app can directly make changes to the employee table. Is there any accepted standards for handling this?
The only possible solution I can think of is to add a trigger to the employee table, and somehow use that to know when a table has changed. But, we have over a thousand tables, and we are making an api call for each table - So, I do not think that adding a thousand triggers to our database is an acceptable solution.
Yes you could add a trigger as suggested. Or you could use a caching system that support expiry time/sliding expiry. So you would be serving up stale data some of the time but not always.
As the other answer a suggests your trigger idea is ok, however as you've stated that would be a lot of triggers.
If your cache is not local to the API, which i assume it isn't if triggers would be able to access. Could you not access it from your desktop application? You could invalidate your cache by removing the employee record from the cache with the desktop application when it makes a successful change to the employee table.
It boils down to..
You have a cache (which is essentially a read store).
You have two options to update it
- Either it times out and fetches (which is ok, if you dont need up to the minute real time data)
- Or is has to be told its data is no longer valid.
Two ways to solve this
Push model
Pull model
Push Model: Using a database trigger for SQL server table to populate an intermediate audit table and polling that using a background task.
Pull Model: Using CLR Trigger and pushing the updates to an API. Whenever DML happens the CLR trigger will call the Api, qhich in-turn can update the cache!
Hope this helps!
Related
I use the podio API to create an item. In the form I have a few calculations. When I retrieve the item immediately after its creation, using the api, the fields are not calculated yet. The calculation is asynchronous, so that makes sense.
When I use a create hook, and fetch the itembased on the hook, the calculated fields are there.
Does anyboofy know if I can depend on his, meaning is the create hook fired after the fields are calculated?
Yes the Javascript calculations are asynchronous.
Plus also related, the MongoDB (which Podio uses on the back end) is "eventually consistent".
I faced this same problem, and ended up making a queueing system for my incoming webhooks, where I waited 30 seconds before actioning any record retrieval from Podio, to get updated values for our local reporting database to cache.
Also related more to the MongoDB asynchronicity... if you are using Globiflow to trigger updates in related tables using the javascript calulated field from the parent table, I found there was occasional incorrect values.
I solved it by adding a 30 second delay in the Globiflow script, before updating the related app/table with calculated fields from the parent app/table. This gave enough time for javascript to calculate and mongodb to save the calculated value
https://www.globiflow.com/help/wait-delay.php
I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.
I am using c#.net, the db is MS SQL 2008 R2.
I have a question that seems to have been asked a lot in the forums here. I want to use a database table as a a queue...but the processing of these messages cannot be done from the database.
I have a table that stores the requests i get from a .Net component. I now have to read the data from these tables and make http calls to 2 webservices. Based on the response received from the webservices, the data gets archived or deleted.
I had a few specific questions:
1. How do i make sure that if i pick a record for processing and the http call fails I should be able to go on to the next record, and then come back to this record at the end of the run
2. Is there an alternative to using the database as a queue(like MSMQ etc.), which option is better
3. I want to maintain an audit trail of the record status. Is creating a trigger to log the changes before the edit the best way to do it?
Regards
Leo
Use Service Broker!
I am using it for a while and think its great thing althought its takes time to understand how it works. Was using book to learn.
Service Broker solves:
concurency
application state
... many, many other things
I have a situation where I need NHibernate to ignore its caches and just hit the database because the data has changed (another user on another computer has changed the data). How is this possible? So far I have had no luck. Get, Load, Linq query, doesn't matter. NHibernate is not getting the most recent data.
For second level cache, you must clear it using ISessionFactory.Evict(typeof(T));. For first level cache, you can simply call ISession.Clear(); .
And if you don't know when to clear a second level cache, you should send some info from another app to this one (via sockets, or webservice...). If that is not possible, you can create a table in the database that tells you when data in database was last modified and then check the record in that table. If it was modified, then you clear the caches. Just be sure the record gets updated every time some other app changes the database (you can do that with triggers or check system tables).
If you use triggers, don't forget to ignore the record if you update with nhibernate.
You can do that with some variable that has the last update time set to it and compare it with that.
I'm working on a project where we are thinking of using SQLCacheDependency with SQL Server 2005/2008 and we are wondering how this will affect the performance of the system.
So we are wondering about the following questions
Can the number of SQLCacheDependency objects (query notifications) have negative effect on SQL Server performance i.e. on insert, update and delete operations on affected tables ?
What effect (performance wise) would for example 50000 different query notifications on a single table have in SQL Server 2005/2008 on insertion and deletion on that table.
Are there any recommendations of how to use SQLCacheDependencies? Any official do‘s and don‘ts? We have found some information on the internet but haven‘t found information on performance implications.
If there is anyone here that has some answers to these questions that would be great.
The SQL Cache dependency using the polling mechanism should not be a load on the sql server or the application server.
Lets see what all steps are there for sqlcachedependency to work and analyze them:
Database is enabled for sqlcachedependency.
A table say 'Employee' is enabled for sqlcachedependency. (can be any number of tables)
Web.config is updated to enable sqlcachedependency.
The Page where u r using sql cache dependency is configured.
thats it.
Internally:
step 1. creates a table 'ASPnet_sqlcachetablesforchangenotification' in database which will store the 'Employee' table name for which sqlcachedependency is enabled. and add some stored procedures aswell.
step 2. inserts a 'Employee' table entry in the 'ASPnet_sqlcachetablesforchangenotification' table. Also creates an insert update delete trigger on this 'Employee' table.
step 3. enables application for sqlcachedependency by providing the connectionstring and polltime.
whenever there is a change in 'Employee' table, trigger is fired which inturn updates the 'ASPnet_sqlcachetablesforchangenotification' table.
Now application polls the database say every 5000ms and checks for any changes to the 'ASPnet_sqlcachetablesforchangenotification' table. if there r any changes the respective caches is removed from memory.
The great benefit of caching combined with freshness of data ( atmost data can be 5 seconds stale). The polling is taken care by a background process with should not be a performance hurdle. because as u see from above list the task are least CPU demanding.
SQLCacheDependency is implemented as an indexed view and every time the table is modified this views index gets changed. so many views (SQLCacheDependency objects) on the same table mean quite a perf hit for modifications. however if you have 1 view (SQLCacheDependency object) per table you should have no problems.
the cache changed notification is async and is triggered when the server has resources.
You're right, not much information on this is provided but there's a phrase related to your question in this page http://msdn.microsoft.com/en-us/library/ms178604%28VS.80%29.aspx
"The database operations associated with SQL cache dependency are simple and therefore do not incur a heavy processing cost on the server."
Hope this helps you although your question is a little bit old already.
This page appears to have some good info on setup which technique to use well (granted I did just skim it).
All I can provide is anecdotal evidence for performance, but we use SqlCacheDependency as a sort of "messaging solution" for a large enterprise application that processes on the order of ten thousand messages per hour.
The basic architecture is that our company uses Perforce for source control and we have a "subscription service" that receives messages from a trigger webservice call than gets called on every p4 commit and inserts a record into a SQL database. Our application has the dependency setup to send subscription notifications for every changeliest that affects a branch or path that you are monitoring.
The performance is fine. Trigger runs on the order of 200ms and we have never had a complaint about the latency of relaying the messages to end users.
As always, your mileage may vary.