The question is open, but I could not find any answers to it on the web. hope it's not a copy:
I have to load some data in my app from a Json file. Every 3 seconds, all objects loaded previously are deleted and then reloaded (re allocated completely). I would like the app first to check if there has been a change, and only if there has been, reload the entire data, or even better, reload just the new data. how can I check if the file has been modified? Can I for example get the date from dropbox and check the date for updated versions of the file?
Ignoring the check for specific data changes, you can store hash / checksum of the file when you read it, and before reloading it compare the checksum / hash with the stored one. No change => no reload.
According to this discussion, you can get the metadata for a file on Dropbox and check the rev field in the metadata. (Stands for revision.) So you can keep track of the rev in your program and only try to reload the file when it changes.
http://forums.dropbox.com/topic.php?id=48787
Beware, there is an older metadata for revision, but it is deprecated. (Also covered in the discussion.)
Related
I wrote a windows service that compares large uncompressed images to their much smaller thumbnails to determine which images need new thumbnails or don't have thumbnails at all... I am using the Created Date of the files to determine which ones require updating (if an uncompressed image has a created date larger than the thumbnail then the thumbnail is out of date).
Everything is working great, my only issue is when I save new versions of the thumbnails over their existing ones... At first I was only doing a simple Bitmap.Save but when overwriting this would only change the Modified Date of the file. I added in a File.Delete() prior to saving the new version and it deletes the old version, saves a new one (as it should), but the Created Date of the new file is STILL THE OLD CREATED DATE...
I deleted every old thumbnail, waited a few minutes and then ran the creation code again, brand new create dates... Is there some timeframe Windows stores the file data in memory and perhaps its recognizing identical file names and giving the new files the old Created Date???
According to the documentation for the File.SetCreationTime(String, DateTime) Method,
NTFS-formatted drives may cache file meta-info, such as file creation time, for a short period of time. As a result, it may be necessary to explicitly set the creation time of a file if you are overwriting or replacing an existing file.
However, if you want to be cautious, rename the original file, say by putting ".old" at the end. That way, it would have to create a new directory entry for the new file. Then, if something goes horribly wrong, there is still the .old copy of it (until you delete that).
I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.
I'm adding some testing to my current project which uses Azure blob storage to store telemetry data coming from a stream analytics job. I want to do testing of the routines that get the telemetry data, so I created a separate container for test data. I downloaded a sample set of data, modified the data to serve my needs and re-uploaded (using Azure storage explorer) everything back into the new container.
The tests were immediately failing and I quickly found out that this is because the LastModified date of the files changed into the date/time of upload. This is fine, but the sequence of the upload was also different. My code uses the modified date of the file to find out which one is the most recent, which would now return a different file based on the new dates.
I found that you cannot modify this property, although you can change another property to have it update. So I know the solution: I could write a quick script which gets the sequence of files from my production instance and then touches every file in the test instance in the same sequence.
But... I was wondering whether this is the best option. I also read it's 'best practice' to store a custom datetime in a separate property, but I don't think I can do that straight from Stream Analytics (which is writing the blobs). I also considered using an Azure Function to do this (new blob => update property), but I'm than adding complexity and something that might fail for whatever reason.
So I'm looking for the best way to solve this problem. Anyone?
Update: this one probably deserves a tiny bit more explanation. Apart from using the LastModified date to sort on, I also use it to filter blobs. The blobs themselves are CSV files containing ASA output data, so telemetry records. Each record has a timestamp, but that information is IN the file. When retrieving data, I don't want to have to dive into each file to find out what the timestamp is of those records. So I use a prefilter to filter out the blobs within a certain timespan, and then only download / open those file to the records inside.
This works perfectly as long as you do not touch any of the blob, but obviously it stops working as soon as any of the blobs gets modified for whatever reason. So I'm now convinced that I need a different / better way to solve this issue; but how?
It seems to me that you have two separate things: the data that you want to store in blob storage and metadata about the blob such as the timestamp. I would create a different (azure) database for the metadata or even simpler just add metadata to the (block)blob:
blockBlob.Metadata.Add("from", dateTime.ToString());
blockBlob.Metadata.Add("to", dateTime.ToString());
blockBlob.Metadata.Add("order", "1");
For sorting I would just add a simple order property.
The comment by #Vignesh deserves the credit here, but in order to get this one marked answer I'll provide it myself.
With ASA, you can set the output to be structured by date/time. That means in this case, data is written to the blob store with a directory structure such as:
2016 / 06 / 27 / 15 / 23 (= 27-06-2016 15:23)
2016 / 06 / 28 / 11 / 02 (= 28-06-2016 11:02)
The ASA output allow you to specify how granular you want the structure to be, in my case I chose to store it by day (so not including a time path). The ASA runtime will now ensure that data from a certain point in time is stored within a blob in that resides in the correct path.
Then I subsequently changed my logic to not use the datetime stamp of the individual blob files any more, but simply read just the files from the folders that are within the timerange I'm interested in. That assures we only get data that was produced within that timerange. And if there's more than one file in a folder, I need to load them both since both were in the same timerange anyway. As long as minutes are enough granularity for you, this works excellent even though it might feel a bit strange to use a folder structure for such a thing.
Having a seperate 'index' for blobs which tracks their datetime would work too of course, but adds complexity which in this case I don't really need.
My iOS Application has been in review, but was rejected regarding the iOS Data Storage Guidelines. In the end, it was rejected because my Core Data database (sqlite) was located in the /Documents folder. I'm was aware, that this folder should only be used, it the data could not be recreated by my application. The reason why I chose to put it there anyway was, that one of the entities in my database contains an attribute telling if the given news has been read. This information cannot be recreated. Is this not enough to put the database in the /Documents folder?
Another thing is, that the current version of my application does not use this value to visualize if the news item has been read or not. So, should I tell the review-team about this attribute and argument why I think it should be placed in the document-folder -- or should I just move it to the /Library/Cache/?
The app review team wants you to split your data apart. Store the re-creatable parts in the Cache folder and the stuff that can't be re-created in the Documents folder. It's okay if there's a little bit of stuff in Documents that could theoretically be re-created—nobody will even notice a title or datestamp—but long text documents, video, audio, or images should be kept in the Cache folder if they can be downloaded again later.
There are a couple different ways you could do this:
Store the downloaded content in the Cache folder and only put the content's filename in your Core Data database (or calculate the filename from something else, like the SHA-1 hash of the URL it was downloaded from). Make sure your code will re-download any content that's not in the cache.
Use two Core Data stores with a single store coordinator. Note that you can't split an entity's attributes across two stores, so you may have to break some of your entities in half. Nor can you create a relationship from an object in one store to an object in another, so you'll have to store the object ID URI instead. See "Cross-Store Relationships" in the "Relationships and Fetched Properties" section of the Core Data Programming Guide for more details.
Whatever you do, keep in mind that iOS may purge your Cache folder at any time. Be prepared for files in your Cache folder to have disappeared; if that happens, you should re-download the files as the user requests them.
I have a folder full of binary files and I want to make a change to these files so that the hash of these files will change. I want to do this is a fashion that doesn't pertinently corrupt the files. Meaning that the change should still allow the file to operate normally or that I should be able to undo the change at any point in time.
Does anyone know of a script that I could use to do this or many a program that will automate this?
Cheers
UPDATE
Its a edge case that I am trying to deal with. I have a system that only allows me to store a file with a given hash once. Hence I am wanting to change the content hash of the file to allow the file to be stored. Note the system in question is not one I control or can change.
Couldn't I just add a random 1 to the end of the file and then remove it afterward without breaking anything? I'm just not sure how to script this - as in how to modify the binary data in this way. Note I'm in a windows environment.
Without knowing the format of the files, we can't tell. It may in fact be impossible - for instance if these binary files are self-signed with some private key. Changing any single bit within the file is likely to render it invalid.
Is your hash calculated purely from the contents, and not any other metadata that you can change (such as filename or modified date)? If so, you're probably out of luck. If the hash is meant to detect when the content changes, but you're trying to change the hash without actually changing the content, you've clearly got a problem...
What is the hash used for? Why do you want to change it? There may be an alternative solution if you could give us more information about the bigger picture.
EDIT: One alternative is to effectively create your own container format - so while a file is stored in your container format, it's not usable in its original form, but it can be extracted easily. Your container could be as simple as "add four bytes at the end as a seed to disturb the hash" - "extracting" the file would just involve copying it and removing the last four bytes. But the important point is that what you end up with isn't an MP3 file or whatever you started with - it's your custom format, simple as it is. You need to package/extract the file any time you interact with the store.