Backfilling Azure Storage Last Access time - azure-storage

Is there a common approach to backfill last access time for existing azure storage content after enabling access tracking?
Our goal is to change a blob's tier from hot to cold after 30 days last access using lifecycle management. The problem we're encountering is existing content won't gain last access metadata until it's read. We do not see lifecycle rules based on last access triggering until that metadata exists.
Two solutions I can think of:
Script reads for all of the content missing last access metadata.
Adding a temporary (24 hour) rule that pushes all content modified x+ days ago to cold, including the option to let it move back to hot on access.
Option 1 feels a bit over the top, programmatically triggering reads for millions of files. Option 2 feels wasteful as some content will immediately move back to hot upon access.
Is there a better way?

Related

Azure functions output caching

I am creating Azure functions to return data from a database (Azure AS). I will be returning same data for all the requests, so caching the output seems like a good idea as the data changes only once a day
What are my options here?
Options listed from most simple to most complex:
One option is to use static variables - but since the process can get recycled very quickly (assume every few minutes), that may not help much.
Cache via storage (Blob / Table). Your function can first try to read from the table, if missing, it can then read from the database and save back to the table. You could have a second timer function that deletes old cache entries every N hours.
I'd recommend starting here.
Azure Functions can still run arbitrary code, you could call out to any other caching service (ie, Redis) and use the same patterns that you'd use in ASP.Net.

Shared Data Storage Strategy for 'Live' Dashboards in Excel VBA

I'm doing an UI in excel which the goal is to have "live" information on Orders and Order Status between three users, I'll name them DataUser, DashboardOne, and DashboardTwo for examples sake.
The process is that the DataUser will fill in the Orders data, that data is going to be used to populate information on two dashboards. The dashboards are going to be updated live with changes from the DataUser(Orders Increases/Decreases), and changes on order status between DashboardOne and DashboardTwo. For the live updates I'm thinking on using Application.OnTime event call to refresh the View/Dashboards. The two dashboards will be active about 8 hours a day.
Where I'm struggling in on how/where to store the Data, I've though about a couple of options but I don't know the implications of one over the other, especially considering that I intend that the dashboards will run/refresh every 30 sec. with Application.OnTime which could prove expensive.
The options I thought about where:
A Master Workbook that would create separate Workbooks for DashboardOne and DashboardTwo and act database and main UI for DataUser.
Three separate workbooks that would all refer to the one DataWorkbook or another flat data file (perhaps and XML or JSON).
Using an actual database for the data, although this would bring other implications (don't currently have one).
I'm not considering a shared workbook as I've tried something similar in the past (and this time ^^, early steps) and it went rather poorly, nightmare to sync and poor data integrity.
In short:
Which would be the best Data storage strategy for Excel that wouldn't jeopardise the integrity of the data nor be so expensive as to interfere with the uptime rest of the code? Are there better options that I should be considering?
There are quite a number of alternatives, depending on the time you want to invest and the tools at hand. I'll give you a couple of options here.
But first, the basic assumptions:
The amount of data items that you need to share (being a dashboard) is of few tens (let's say, less than 100),
You have at least basic programming skills,
From your description, you have one client with READ-WRITE capabilities while there are two clients with READ-ONLY capability.
OPTION 1:
You can have the Excel saving the data in CSV format (very small amount of data and hence it would take a small fraction of a second to save it and to read it).
The two clients would then open the file in read-only mode, load the data and update the display. You would need to include exception handling at both types of client:
At the one writing, handle the condition of error when it attempts to write at the same time one of the clients attempts to read,
At the two reading, handle the condition of error when attempting to open the file (for read only) while the other process is writing.
Since the write and read operations are going to take a very, VERY short time (as stated, a small fraction of a second), these conditions will be very rare. Additional, since both dashboard clients would be open the file for read-only, they will not disturb each other if they make their attempt at the same moment.
If you wish to drastically reduce the chances of collision, you may set the timers (of the update process on one hand and of the reading processes on the other) to be a primary number of seconds. For instance, the timer of the updating process would be every 11 seconds while that of the reading process would be every 7 seconds.
OPTION 2:
Establish a TCP/IP channel between the processes, where the main process (meaning the one that would have WRITE privilege) would send a triggering message to the other two requesting to start an update whenever a new version of the data had been saved. Upon reception of the trigger, both READ-ONLY processes would approach the file and fetch the data.
In this case, the chances of collision would become near to null.

Programmatically purge document deletion

I've a database with an agent that periodically delete (via Java agent, "removePermanently" method) all documents in a view and re-create them.
After some month, i've noticed that database size is considerably increased.
Showing database information through this command
sh database <dbpath>
it results that i've a lot of deleted documents (i suppose they are deletion stubs)
Document Type Live Deleted
Documents 1,922 817,378
Compacting database, 80% space was recovered.
Is there a way to programmatically delete stubs definitively to avoid "database explosion"? Or, is there a way to correctly manage this scenario (deletion and creation of documents)?
Don't delete the documents! Re-use them. That's the best answer. Seriously. Take the existing documents, clear the fields and set Form := "Obsolete". Modify the selection formula for all your views by appending & Form != "Obsolete" Create a new hidden view called "Obsolete" with selection formula Form = "Obsolete", and instead of creating new documents, change your code to go to the Obsolete view, grab an available document and set new field values (including changing the Form field). Only create new documents if there are not enough available in the Obsolete view. Any performance that you lose by doing this, which really should be minimal with the number of documents that you seem to have, will be more than offset by what you will gain by avoiding the growth and fragmentation of the NSF file that you are creating by doing all the deletions and creating new documents.
If, however, there's no possible way for you to do that -- maybe some third party tool that is outside of your control is creating the documents -- then it's important to know if the database you are talking about is replicated. If it is replicated, then you must be very careful because purging deletion stubs before all replicas are brought up to date will cause deleted documents to "come back to life" if a replica that has been off-line since before the delete occurs comes back on-line.
If the database is not replicated at all, or is reliably replicated across all replicas quickly, then you can reduce the purge interval. Go to the Replication Settings dialog, find the checkbox labeled "Remove documents not modified in the last __ days". Do not check the box, but enter a small number into the number of days. The purge interval for deletion stubs will be set to 1/3 of this number. So if you set it to 3 the effect will be that stubs are kept for 1 day and then purged, giving you 24 hours to assure that all replicas are up to date. If you need more, set the interval higher, maintaining the 3x multiple as needed. If a server is down for an extended period of time (longer than your purge interval), then adjust your operations procedures so that you will be sure to disable replication of the database to that server before it comes back on line and the replica can be deleted and recreated. Be aware, though, that user replicas pose the same problem, and it's not really possible to control or be aware of user replicas that might go off-line for longer than the purge interval. In any case, remember: do not check the box. To reduce the purge interval for deletion stubs only, just reduce the number.
Apart from this, the only way to programmatically delete deletion stubs requires use of the Notes C API. It's possible to call the required routines from LotusScript, but in my experience once the total number of stubs plus documents gets too high you will likely run into an error and may have to create and deploy a new non-replica copy of the database to get past it. You can find code along with my explanation in the answer to this previous question.
I have to second Richard's recommendation to reuse documents. I recently had a similar project, and started the way you did with deleting everything and importing half a million records every night. Deletion stubs and the growth of the FT index quickly became problems, eating up huge amounts of disk space and slowing performance significantly. I tried to manage the deletion stubs, but I was clearly going against the grain of Domino's architecture.
I read Richard's suggestion here, and adopted that approach. Here's what I did:
1) create 2 views based on form - one for 'active' records, and another for 'inactive' records
2) start the agent by setting autoupdate = false for both views
3) use stampall("form", "inactive") to change all fo the active records to inactive
4) manually refresh the 2 views using notesview.refresh()
5) start importing data. for each record, pull a document out of the pool of inactive records (by walking the 'inactive' view)
6) if if run out of inactive records in the pool, create new ones
7) when import is complete, manually refresh the views again
8) use db.createftindex(0, true) to re-create the FT index
the code is really not that complex, and it runs in about the same amount of time, if not faster, than my original approach.
Thanks Richard!
Also, look at the advanced db properties - several things there that will help optimize the db.
It sounds like you are "refreshing" the contents of the database by periodically deleting all the documents and creating new ones from some other source. Cut that out. If the data are in the Notes database already, leave the document alone. What you're doing is very inefficient.

Avoiding a complete resync of Azure's SQL Data Sync

So as a I understand it, if you have an outstanding sync error for more than 40 days Azure's SQL Data Sync forces you to do a fresh upload of your entire database in order to get the service working again.
I'm wondering if there is a way to avoid this as my internet is quite slow and a complete re-sync would take half a week to complete. Is there a way to trick the system into thinking that the error was not present for 40 days and resume it's differential back up after the error has a been corrected?
Unfortunately this is correct. See the MSDN topics http://msdn.microsoft.com/en-us/library/hh667321.aspx#bkmk_databaseoutofdatestatus (for an out-of-date database) and http://msdn.microsoft.com/en-us/library/hh667321.aspx#bkmk_sgoutofdatestatus (for an out-of-date sync group).
If it is a database that is out-of-date, you're better off deleting the database then creating it anew, but leaving it empty of data, and adding the empty database to your sync group. If you merely remove then re-add the populated database to the sync group, on the first sync Data Sync will treat every row as a conflict that needs to be resolved (even if the data is unchanged in the row) which adds a lot of time to the initial sync (and may add costs too). See the topic http://msdn.microsoft.com/en-us/library/hh667328.aspx#InitialSync.

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.