Notification about azure blob object changes - file-upload

Can I somehow subscribe for notifications about Azure's blob object changes?
My purpose is to delegate file uploads to the client using SAS and lately (after upload is complete) update the database. It looks like I need to continuously check blob's state, but it is quite resource consuming process.

You can't be notified by the Blob Storage about a change made to a blob, but as you point out, you can monitor it, requesting the ETag on a scheduled basis to see if it's done.
That being said, the cost to monitor a blob (or even a whole container) can be close to negligible if correctly implemented. Pinging the Blob Storage once per second costs you roughly $2.5 / month. Then, by using some heuristic you can probably lower this cost to $0.25 (one check per 10s on average). At this point, it's not really interesting to try to optimize more.

You can now do this using Azure functions
Create a blob trigger by specifying your storage account connection
string and your container/{name}
In outputs, select the place where
you want your notification to go to

Another option to consider is to have the client notify you when it's done uploading.

I created a file change monitor for monitoring blobs - full details at http://ben.onfabrik.com/posts/monitoring-files-in-azure-blob-storage

Related

Sending data from website to BigQuery using Pub/Sub and Cloud Functions

Here's what I'm trying to accomplish
A visitor lands on my website
Javascript collects some information and sends a hit
The hit is processed and inserted into BigQuery
And here's how I have planned to solve it
The hit is sent to Cloud Functions HTTP trigger (using Ajax)
Cloud Functions sends a message to Pub/Sub
Pub/Sub sends data to another Cloud Function using a Pub/Sub trigger
The second Cloud Function processes the hit into Biguery row and inserts it into BigQuery
Is there a simpler way to solve this?
Some other details to take into account
There are around 1 million hits a day
Don't want to use Cloud Dataflow because it inflates the costs
Can't (probably) skip Pub/Sub because some hits are sent when a person is leaving the site and the request might not have enough time to process everything.
You can perform a Big Query streaming, this one is less expensive and you avoid reach the Load Jobs quotas 1000 per table per day.
Another option is if you don't mind that the data spend a lot of time loading, you can store all the info in a Cloud Storage bucket and then load all the data with a transfer. You can program it in order that data be uploaded daily. This solution is focus in a batch environment in which you will store all the info in one side and then you transfer it to the final destination. If you only want to streaming the solution that you mentioned is ok.
It’s up to you to choose the option that better fits to your specific usage.

Azure functions output caching

I am creating Azure functions to return data from a database (Azure AS). I will be returning same data for all the requests, so caching the output seems like a good idea as the data changes only once a day
What are my options here?
Options listed from most simple to most complex:
One option is to use static variables - but since the process can get recycled very quickly (assume every few minutes), that may not help much.
Cache via storage (Blob / Table). Your function can first try to read from the table, if missing, it can then read from the database and save back to the table. You could have a second timer function that deletes old cache entries every N hours.
I'd recommend starting here.
Azure Functions can still run arbitrary code, you could call out to any other caching service (ie, Redis) and use the same patterns that you'd use in ASP.Net.

Reliability of Windows Azure Storage Logging

We are in the process of creating a piece of software to backup a storage account (blobs & tables, no queues) and while researching how to do this we came across the possibility storage logging. We would like to use this feature to do smart incremental backups after an initial full backup. However in the introductory post for this feature here the following caveat is mentioned:
During normal operation all requests are logged; but it is important to note that logging is provided on a best effort basis. This means we do not guarantee that every message will be logged due to the fact that the log data is buffered in memory at the storage front-ends before being written out, and if a role is restarted then its buffer of logs would be lost.
As this is a backup solution this behavior makes the features unusable, we can't miss a file. However I wonder if this has changed in the meantime as Microsoft has built a number of features on top of it like blob function triggers and very recently their new Azure Event Grid.
My question is whether this behavior has changed in the meantime or are the logs still on a best effort basis and should we stick to our 'scanning' strategy?
The behavior for Azure Storage logs is still same. For your case, you might be better off using the EventGrid notification for Blob storage: https://azure.microsoft.com/en-us/blog/introducing-azure-event-grid-an-event-service-for-modern-applications/

Should I cache blob content to local HD?

Suppose I have files in blob storage, and these files are constantly used by my web application hosted in Windows Azure.
Should I perform some sort of caching of these blobs, like downloading them to my app's local hard-drive?
Update: I was requested to provide a case to make it clear why I want to cache content, so here it goes: imagine I have an e-commerce web-site and my product images are all high-resolution. Sometimes, though, I would like to serve them as thumbnails (eg. for product listings), and one possible solution for that is to use an HTTP handler to resize the images on demand. I know I could use output-cache so that the image just needs to be resized once, but for the sake of this example, let us just consider I would process the image every time it was requested. I imagine it would be faster to have the contents cached locally. In this case, would it be better to cache it on the HD or to use local-storage?
Thanks in advance!
Just to start answering your question, yes accessing a static content from Role specific local storage would be faster compare to accessing it from Azure blob storage due to network latency even when both compute and blob are in same data center.
There could be a solution in which you can download X amount of blobs from Azure storage during startup task (or a background task) in Role specific Local Storage and reference these static content via local storage however the real question is for what reason you want to cache the content from Azure blob storage? Is it for faster access or for reliability? If reason is to have static content accessible almost immediately then I could think of having it cached at local storage.
There are pros and cons of each approach however if you can provide the specific why would you want to do that, you may get much better to the point response.
Why not use a local resource? It gives you a path to a folder on the HD, and you can get a lot of space. You can even keep it around between restarts.
Another option is Azure Cloud Drive. It's fast, and would allow you to share the cache among instances (but only can write at once).
Erick

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.