Azure DataFactory Retain LastModified Date when copying file - azure-data-factory-2

I want to (automatically, but as part of a pipeline) archive some existing files, by moving them to a new folder.
I've written a pipeline to do that, but since it's a "Copy-and-delete-Original" command, the new file has a new Timestamp.
Is there any way to retain the original timestamps, either by actually moving the file, or by explicitly setting the LastModified date? (there doesn't appear to be a setting on the copy data activity to retain the Timestamp :(

I don't think this is supported through ADF's web UI. I could be wrong, but I haven't see a way to do it.
But you could call the REST API for Blob services and set the lastmodifieddate that way. You could get the file's original lastmodifieddate using the getmetadata activity and then copying the file to the new location, and then call the REST API and reset the property.
https://learn.microsoft.com/en-us/rest/api/storageservices/set-blob-properties

Related

Amazon S3: How to safely upload multiple files?

I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.

Creating a test-data container in Azure blob storage

I'm adding some testing to my current project which uses Azure blob storage to store telemetry data coming from a stream analytics job. I want to do testing of the routines that get the telemetry data, so I created a separate container for test data. I downloaded a sample set of data, modified the data to serve my needs and re-uploaded (using Azure storage explorer) everything back into the new container.
The tests were immediately failing and I quickly found out that this is because the LastModified date of the files changed into the date/time of upload. This is fine, but the sequence of the upload was also different. My code uses the modified date of the file to find out which one is the most recent, which would now return a different file based on the new dates.
I found that you cannot modify this property, although you can change another property to have it update. So I know the solution: I could write a quick script which gets the sequence of files from my production instance and then touches every file in the test instance in the same sequence.
But... I was wondering whether this is the best option. I also read it's 'best practice' to store a custom datetime in a separate property, but I don't think I can do that straight from Stream Analytics (which is writing the blobs). I also considered using an Azure Function to do this (new blob => update property), but I'm than adding complexity and something that might fail for whatever reason.
So I'm looking for the best way to solve this problem. Anyone?
Update: this one probably deserves a tiny bit more explanation. Apart from using the LastModified date to sort on, I also use it to filter blobs. The blobs themselves are CSV files containing ASA output data, so telemetry records. Each record has a timestamp, but that information is IN the file. When retrieving data, I don't want to have to dive into each file to find out what the timestamp is of those records. So I use a prefilter to filter out the blobs within a certain timespan, and then only download / open those file to the records inside.
This works perfectly as long as you do not touch any of the blob, but obviously it stops working as soon as any of the blobs gets modified for whatever reason. So I'm now convinced that I need a different / better way to solve this issue; but how?
It seems to me that you have two separate things: the data that you want to store in blob storage and metadata about the blob such as the timestamp. I would create a different (azure) database for the metadata or even simpler just add metadata to the (block)blob:
blockBlob.Metadata.Add("from", dateTime.ToString());
blockBlob.Metadata.Add("to", dateTime.ToString());
blockBlob.Metadata.Add("order", "1");
For sorting I would just add a simple order property.
The comment by #Vignesh deserves the credit here, but in order to get this one marked answer I'll provide it myself.
With ASA, you can set the output to be structured by date/time. That means in this case, data is written to the blob store with a directory structure such as:
2016 / 06 / 27 / 15 / 23 (= 27-06-2016 15:23)
2016 / 06 / 28 / 11 / 02 (= 28-06-2016 11:02)
The ASA output allow you to specify how granular you want the structure to be, in my case I chose to store it by day (so not including a time path). The ASA runtime will now ensure that data from a certain point in time is stored within a blob in that resides in the correct path.
Then I subsequently changed my logic to not use the datetime stamp of the individual blob files any more, but simply read just the files from the folders that are within the timerange I'm interested in. That assures we only get data that was produced within that timerange. And if there's more than one file in a folder, I need to load them both since both were in the same timerange anyway. As long as minutes are enough granularity for you, this works excellent even though it might feel a bit strange to use a folder structure for such a thing.
Having a seperate 'index' for blobs which tracks their datetime would work too of course, but adds complexity which in this case I don't really need.

Accurev - How to update the file content in a stream after a promote

We have different streams for different environments. It is a grail project. So there is a property file called application.properties which has a property called app.version. I want that to be updated automatically post every promote done on the stream. Each stream will have different version number. Trigger server_post_promote_trig will be used to handle the post promote operation, but I am not sure how to access the files in the stream through script. I tried to give the path as /Folder1/file as reflected in the xml trigger input file, but I cannot update the file as trigger perl file complains it cannot find the file.
Any help is much appreciated.
If I understand your question correctly. You want to increment the version in a file under source control when ever a promotion occurs in the stream. If this is correct, you need to create a workspace off said stream which will edit/keep/promote the new version of this file. I would create a separate script that gets called by the server_post_promote trigger whenever a promotion occurs in this stream. This script would be placed under src control which is accessible in the workspace you created above.
In Accurev, files can only be modified via a workspace. As this is the case It may be better to implement a pre promote trigger to update this version information in the file when the user is performing the workspace to stream promote.
This would be similar to the existing Addheader script that can be found in the examples directory on the accurev server.
Also, within the script, you will probably want to build in logic to detect the promotion of the version file, to block updating the file again.

Comparing two JSON files objective c

The question is open, but I could not find any answers to it on the web. hope it's not a copy:
I have to load some data in my app from a Json file. Every 3 seconds, all objects loaded previously are deleted and then reloaded (re allocated completely). I would like the app first to check if there has been a change, and only if there has been, reload the entire data, or even better, reload just the new data. how can I check if the file has been modified? Can I for example get the date from dropbox and check the date for updated versions of the file?
Ignoring the check for specific data changes, you can store hash / checksum of the file when you read it, and before reloading it compare the checksum / hash with the stored one. No change => no reload.
According to this discussion, you can get the metadata for a file on Dropbox and check the rev field in the metadata. (Stands for revision.) So you can keep track of the rev in your program and only try to reload the file when it changes.
http://forums.dropbox.com/topic.php?id=48787
Beware, there is an older metadata for revision, but it is deprecated. (Also covered in the discussion.)

Creating multiple branches/streams in RTC source control

What is the standard method of creating multiple streams of development of the same project in RTC source control?
Currently to create a single stream I create a repository workspace & its corresponding stream. I check in the project to the workspace and then deliver it to this new stream. To create a new stream of development for the project do I need to repeat this process or is there a better way, maybe using the command line ?
No, you don't need to repeat the process.
I would recommend putting a baseline on the component you delivered in the first stream. Or put a snapshot on the first stream. That will label all the components in that stream.
Then you create a second stream, which you can :
fill component by component, specifying for each one a baseline
or specifying directly at snapshot, which will put all the components with their associated labels in that new stream.
Then you create your repository workspace and start working.
So the idea behind a new stream is to specify from what version you want to start working.
Hence the baselines or snapshot put in the first stream : that will help initialize the next stream.
Without having to re-import everything.