Creating a test-data container in Azure blob storage - azure-storage

I'm adding some testing to my current project which uses Azure blob storage to store telemetry data coming from a stream analytics job. I want to do testing of the routines that get the telemetry data, so I created a separate container for test data. I downloaded a sample set of data, modified the data to serve my needs and re-uploaded (using Azure storage explorer) everything back into the new container.
The tests were immediately failing and I quickly found out that this is because the LastModified date of the files changed into the date/time of upload. This is fine, but the sequence of the upload was also different. My code uses the modified date of the file to find out which one is the most recent, which would now return a different file based on the new dates.
I found that you cannot modify this property, although you can change another property to have it update. So I know the solution: I could write a quick script which gets the sequence of files from my production instance and then touches every file in the test instance in the same sequence.
But... I was wondering whether this is the best option. I also read it's 'best practice' to store a custom datetime in a separate property, but I don't think I can do that straight from Stream Analytics (which is writing the blobs). I also considered using an Azure Function to do this (new blob => update property), but I'm than adding complexity and something that might fail for whatever reason.
So I'm looking for the best way to solve this problem. Anyone?
Update: this one probably deserves a tiny bit more explanation. Apart from using the LastModified date to sort on, I also use it to filter blobs. The blobs themselves are CSV files containing ASA output data, so telemetry records. Each record has a timestamp, but that information is IN the file. When retrieving data, I don't want to have to dive into each file to find out what the timestamp is of those records. So I use a prefilter to filter out the blobs within a certain timespan, and then only download / open those file to the records inside.
This works perfectly as long as you do not touch any of the blob, but obviously it stops working as soon as any of the blobs gets modified for whatever reason. So I'm now convinced that I need a different / better way to solve this issue; but how?

It seems to me that you have two separate things: the data that you want to store in blob storage and metadata about the blob such as the timestamp. I would create a different (azure) database for the metadata or even simpler just add metadata to the (block)blob:
blockBlob.Metadata.Add("from", dateTime.ToString());
blockBlob.Metadata.Add("to", dateTime.ToString());
blockBlob.Metadata.Add("order", "1");
For sorting I would just add a simple order property.

The comment by #Vignesh deserves the credit here, but in order to get this one marked answer I'll provide it myself.
With ASA, you can set the output to be structured by date/time. That means in this case, data is written to the blob store with a directory structure such as:
2016 / 06 / 27 / 15 / 23 (= 27-06-2016 15:23)
2016 / 06 / 28 / 11 / 02 (= 28-06-2016 11:02)
The ASA output allow you to specify how granular you want the structure to be, in my case I chose to store it by day (so not including a time path). The ASA runtime will now ensure that data from a certain point in time is stored within a blob in that resides in the correct path.
Then I subsequently changed my logic to not use the datetime stamp of the individual blob files any more, but simply read just the files from the folders that are within the timerange I'm interested in. That assures we only get data that was produced within that timerange. And if there's more than one file in a folder, I need to load them both since both were in the same timerange anyway. As long as minutes are enough granularity for you, this works excellent even though it might feel a bit strange to use a folder structure for such a thing.
Having a seperate 'index' for blobs which tracks their datetime would work too of course, but adds complexity which in this case I don't really need.

Related

Why is matillion not loading data from S3?

I have a simple S3 load with all the correct information. There are no validation errors but and the package executes without a problem. It's just that there is no data in the table. Any tips from someone that is knowledgeable about Matillion?
There are a number of reasons why Matillion might not appear to load any data in an S3 Load.
Firstly, I'd check that the pattern matches the file names in the S3 location, which is a regular expression match.
I believe that also includes the path which you may have included in the location parameter, so it may be worth modifying your pattern to look something like .*\/FilePrefix.* or even just .* and then selecting the actual file in the location parameter
Secondly, if the files were last modified more than 64 days ago, or they have already been loaded in to the table previously, Snowflake won't load them by default, which you can get around by turning the Force Load parameter On.

Amazon S3: How to safely upload multiple files?

I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.

Get a list of files on S3 that have changed since X

I have about 40,000 images up on S3 and I've downloaded into my application/database then sent them out to another site (like ebay or magento)
This is to support a client that sells his products on a few sites. Sites which really you'd rather keep a copy of the product image on their site. (so they can resize it and such)
My issue right now is that I want to poke S3 every once and a while looking for new files, or modified files.
I don't much like the idea of targeting each file one at a time. Nor do I like the idea of bringing down all the file names and dates and then comparing them with dates I've stored. Both seem to be quite wasteful especially if I want to run this every day (or every hour).
What I had hoped for, and what I'm looking for is a way to say "give me the names of all the files that have changed since 2013-10-14 13:10:30. This would let me store just one value, and if nothings changed, then I'd get back nothing (or something that indicated nothing).
Is there a way to get a list of changed files since X date?
I'm language agnostic.. though Ruby/Rails would be cool.
Note: I've tried to figure it out with the WSDL, but it doesn't quite seem to help as much as I'd hope.
Unfortunately S3 does not offer any support for this.
Currently your only options are to either list all the objects in the S3 bucket and check for changes or to keep track of the changed objects separate from S3 (record the last changed timestamp in some data store when you change the objects).

Need suggestions on which option will be efficient to store data on iPad

This is my first time that I am working on a big project for a client. So I was not sure how to solve this problem. However I have come up with two different ideas but I need professionals opinion about which one is better :)
Situation :
There is an application which runs on different client's iPad. Application data is stored by using giant XML file. This XML file is shared among all client by a server. So a server has a centralised copy and each client has their own copy. Once client made changes to their XML copy they updates server copy in and other client updates their copy by updated server copy.
Now only one client can make changes at one time, To fix this I have logic by which before client starts editing XML they need to get ownership from server and server will only allow one client to edit at one time.
Visual Representation :
Now on client side I have to think of a logic by which I will update my client copy and upload it to server. There are two options,
Option 1 :
In option 1, I can directly manipulate XML file by using GDataXML parser and upload that copy to server. For persistence I can save client copy on my iPad in document directory.
Option 2 :
In option 2, I can read XML file create a CoreData representation for local storage. When ever I update data inside core data it will I will change XML file too and than upload that file on server. Double work but I guess better persistence.
Now which one more robust and advisable? Personally I was planning to do option 2 because it seems more robust as I am persisting application data in core data. But option 1 seems more easy work but I don't know how good persistency will remain.
Sorry for lengthy question,
Thanks for any input given.
There are a number of factors which would influence selecting the second option over the first.
How big is the XML file? If you need to work with very large documents, you may need to incrementally parse the XML (SAX) into core data. This will allow you to access the document's contents without loading it all into memory at once.
Do you need to run complex queries in the data? If so, you may be better off using core data fetch predicates, rather than xpath or XSL.
Are you already using core data? Depending on how the XML data is structured, it might be simpler overall to import the data into your existing persistent store.
Otherwise, you can probably make due with parsing the entire document and either traversing the resulting tree or querying with xpath.
If you need to create an object graph based on what you get from server and show it to user (which you most probably need to do), you should stick up to second option, since it allows easy and robust data persistence.
If you do not need to present user with any data from the XML file you can, of course, store it in the Documents directory.
So, if this is a client application and it has at least some visual representation of the data from an XML file you should use CoreData.
If you want a regular update of data , then use CoreData

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....