amazon s3 renaming and overwriting files, recommendations and risks - amazon-s3

I have a bucket with two kinds of file names:
[Bucket]/[file]
[Bucket]/[folder]/[file]
For example, I could have:
MyBucket/bar
MyBucket/foo/bar
I want to rename all the [Bucket]/[folder]/[file] files to [Bucket]/[file] files (and thus overwriting / discarding the [Bucket]/[file] files).
So as in the previous example, i want MyBucket/foo/bar to become MyBucket/bar (and overwrite / duscard the original MyBucket/bar).
I tried two methods:
Using s3cmd's move command: s3cmd mv s3://MyBucket/foo/bar s3://MyBucket/bar
Using Amazon's SDK for php: rename(s3://MyBucket/foo/bar, s3://MyBucket/bar)
Both methods seem to work, but - considering I have to do this as a batch process on thousands of files,
my questions are:
Which method is preferred?
Are there other better methods?
Must I delete the old files prior to the move/rename? (it seems to work fine without it, but I might not be aware of risks involved)
Thank you.

Since I asked this question about 5 months ago, I had some time to gain some insights; so I will answer it myself:
From what I have seen, there is no major difference performance-wise. I can imagine that calling s3cmd from within PHP might be costly, due to invoking an external process for each request; but then again - Amazon's SDK uses cURL to send it's requests, so there is not much of a difference.
One difference I did notice, is that Amazon's SDK tends to throw cURL exceptions (seemingly randomly, and rarely), but s3cmd did not crash at all. My scripts run on 10's of thousands of files, so I had to learn the hard way to deal with these cURL exceptions.
My theory is that cURL crashes when there is a communication conflict on the server (for example, when two processes try to use the same resource). I am working on a development server on which sometimes several processes access S3 with cURL simultaneously; these are the only situations in which cURL exhibited this behaviour.
For conclusion:
Using s3cmd might be more stable, but using the SDK allows more versatility and better integration with you PHP code; as long as you remember to handle the rare cases (I'd say 1 for every 1000 requests, when several processes run simultaneously) in which the SDK throws a cURL exception.

Since either methods, s3cmd and SDK, will eveltually issue the same REST call, you can safely choose the one is best for you.
When you are moving a file, if the target exists, it is always replaced, then, if you don't want this behavior, you will need to check whether the target file name already exists, in order to perform or not the move operation.

Related

A persistent simple data storage for Node.JS app implementation?

I'm planning to launch a simple Node.JS utility and push it to heroku. A fire and forget solution, will sleep for like 90% of the time probably. Unfortunately it seems that I require a persistent data storage for my purposes (heroku apps get rebooted daily and storing everything in RAM is unrealistic), and I don't know which way to look as:
Most SQL hostings are paid / limited time free / require constant refreshing ( like freemysqlhosting ).
Storing stuff in plain .txt format is seemingly hard to implement, besides git always overwrites the contents of a tracked .txt file, and leaving it untracked disposes of it on heroku and leads to ENOENT No such file error. Yeah, I tried.
So, the question is - how do I implement a simple and built in solution for storing data? Are there any relevant typical solutions? It's going to be equivalent to just 1 SQL table.
As you can see, you can answer this on many levels - maybe suggest a free deploy and forget SQL hosting (it obviously has to support external connections), maybe tell me how to keep a file tracked in git without actually replacing all of its content with every commit, maybe suggest some module to install. I hope this is not too broad..

When AEM is configured to use a S3 data store will it make blue-green deployments faster?

Background
We know it's possible to setup a devops pipeline that deploys updates to AEM via a blue/green approach by using crx2oak to migrate the content from old to new environment. Why is out of scope of this question.
The problem with this approach is the content copy operation can take a significant time, as the amount of content in the JCR grows. Other ideas to mittigate this are appreciated.
We also know that AEM can have a S3 datastore that off-loads the binary content into a S3 bucket which would not be re-built during blue/green deployment as per:
https://helpx.adobe.com/experience-manager/6-3/sites/deploying/using/storage-elements-in-aem-6.html#OverviewofStorageinAEM6
What is unclear from Adobe's documentation is whether the same S3 bucket can be shared across AEM instances (i.e. blue/green instances). Maybe it's just my google fu that has failed...
Question(s)
When a new AEM instance is configured to use a S3 datastore that already has content in it from the old instance, when crx2oak is used to migrate content, will the new instance be able to access the existing content?
Are there any articles/blogs that describe what the potential time savings of this approach would be?
Yes I could do an experiment, and may do so in the future to answer my own question. I'm looking for information from anyone who has already done this? I'm an engineer so will not re-invent the wheel if someone else has done so.
You can certainly share the same S3 bucket between instances - in fact, this is commonly used along with binary-less replication from author->publisher(s) and is a tried and true configuration.
It's even possible to share the same bucket between completely different environments (e.g. DEV/STAGE, or BLUE/GREEN in your case). The main "gotcha" to be aware of is with regard to DataStore Garbage Collection (DSGC) because it's very possible that there will be blobs which are referenced by only some of the instances sharing the bucket and so when purging unused blobs this needs to be taken into account.
This is all part of the design though, and there is a flag designed specifically for this purpose which tells DSGC to only execute the first phase (the "mark" phase) of GC, and skip the 2nd "sweep" phase, until all instances have marked which blobs they wish to keep/discard. Once all instances have done so the sweep phase can be run to purge blobs not needed by any instances using the bucket.
For a more detailed explanation see the Oak docs:
https://jackrabbit.apache.org/oak/docs/plugins/blobstore.html#Shared_DataStore_Blob_Garbage_Collection_Since_1.2.0
I find it helps to understand that pretty much all of the datastore implementations are done such that blobs are stored according to their checksum, so the same file added uploaded twice will only have one copy stored in the datastore, and there will be two segment store records referencing that same blob. In the same way, multiple AEM instances sharing the same bucket will be able to find a given blob regardless of which instance put it there in the first place.
You can observe see this in action easily with FileDataStore by finding a blob and sha256'ing it - e.g. (this example is on OS X, the checksum command on Linux/Windows will be slightly different):
$ shasum -a256 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
There you can see that a) the filename is the checksum, and b) it's nested using the first 3 pairs of characters from that checksum, so you can locate the file by just knowing the hash and if you store the same binary, even if the name or JCR metadata is different, the blob referenced will be the same literal file on disk.
From memory S3 datastore uses prefixes rather than directory nesting because this performance better, but the principle is the same.
Finally, a couple of things to consider are:
1) S3 storage is relatively cheap (and practically unlimited) so there is an argument to be made that it's not as necessary to perform regular DSGC unless you're really trying to pinch pennies.
2) If you do run DSGC you need to think about how this will work with whatever backup strategy you're using for the AEM instances. For instance, if you roll back a segment store shortly after running DSGC you'll likely have to recover some of those purged blobs. You can use versioning and/or lifecycle rules to help with this, but it can add significant additional complexity and time to your restore process.
If you opt to simply skip DSGC and leave the blobs there indefinitely it's a good idea to make sure the access key or IAM roles AEM is using doesn't have the DeleteObject permission for the bucket, just to be sure a rogue GC process can't delete anything.
Hope this helps.
Edit
In all that I forgot to actually answer your question - yes it will save some time in cloning in most cases. You'll still need to sync the segment store (obviously) and there are various approaches for this. crx2oak is certainly one - you'll see in the documentation there are specific options for using it w/ S3 where you supply a configuration file (basically a serialised .config file like you'd use with Felix/OSGi).
You can also use something like rsync to simply copy the TAR files over (while at least the target AEM is stopped. Oak is generally atomic so a hot copy from the source can work in theory, but YMMV).
Finally you could obviously use Mongo and cluster the segment store that way, but all the usual cost/complexity/performance issues with doing so apply).
Another interesting development on the horizon for blue/green type is the CompositeNodeStore - there is a good talk from the 2017 adaptTo() conference that talks about this:
https://adapt.to/2017/en/schedule/zero-downtime-deployments-for-the-sling-based-apps-using-docker.html
An external datastore will help a lot, as usually the most space is used by binary assets. The pure content typed in by real people is much less.
On my current project (quite small, but relations should be normal):
Repository 4,8 GB total (4.1 GB Segment Store, 780 MB Index)
File DataStore 222 GB total
If you wanna do it, I have the following remarks:
There are different datastores available. For testing I would start with the File DataStore.
The S3 DataStore makes only sense in my point of view, if you are hosting at Amazons AWS anyway. Adobe Managed Services is doing this, and so S3 makes sense for them. But also there only if you have more than 500 GB assets.
If you use the green/blue approach, then be careful the DataStore garbage collection (just do it manually). The shared Datastore is meant for several publishers, that have the same content. As example you could have the following situation: Your editors delete some assets, you run the DataStore GC and finally your rollback your environment. That means the assets are still in the content repository, but the binaries are cleaned out of the DataStore.
In order to to use a shared file datastore, you need to do the following:
Unpack Quickstart java -jar AEM_6.3_Quickstart.jar -unpack
Create an directory for the file datastore (anywhere outside of the crx-quickstart folder)
Create a directory install inside the extracted crx-quickstart folder
Create a file called org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.cfg inside this install folder
This file contains just 1 line path=<path to file datastore> (see https://jackrabbit.apache.org/oak/docs/osgi_config.html)
Place a reference.key file inside the datastore directory. First time it will be created automatically. But if you use always the same key, the same hash-values are used all datastores across all your environments. This is also a prerequisite for a feature called "binary-less replication" (so binary would only be replicated the first time between author and publisher)
kind regards,
Alex

Schema migration in Redis

I have an application using redis. I used a key name user:<id> to store user info. Then locally I changed my app code to use key name user:<id>:data for that purpose.
I am scared by the fact that if I git push this new code to my production server things will break. And the reason for this is that since my production redis server would already have the keys will older key names.
So the only way I think is to stop my app, change all the older key names to new ones & then restart it. Do you have a better alternative? Thanks for help :)
Pushing new code to your production environment is always a scary business (that's why only the toughest survive in this profession ;)). I strongly recommend that before you change your production code and database, make sure that you test the workflow and its results locally.
Almost any update to the application requires its stoppage - even if only to replace the relevant files. This is even truer for any changes that involve a database exactly because of the reason you had mentioned.
Even if you can deploy your code changes without stopping the application per se (e.g. a PHP page), you will still want the database change to be done "atomically" - i.e. without any application requests intervening and possibly breaking. While some database can be taken offline for maintenance, even then you usually stop the app or else errors will be generated all over the place.
If that is indeed the case, you'll be stopping the app (or putting it into maintenance mode) regardless of the database change, so we take your question to actually mean: what's the fastest way to rename all/some keys in my database?
To answer that question, similarly to the pseudo-code suggested above, I suggest you use a Lua script such as the following and EVAL it once you stop the app:
for _,k in ipairs(redis.call('keys', 'user:*')) do
if k:sub(-5) ~= ':data' then
redis.call('rename', k, k .. ':data')
end
end
A few notes about this script that you should keep in mind:
Although the KEYS command is not safe to use in production, since you are doing maintenance it can be used safely. For all other use cases where you need to scan you keys, Redis' SCAN is much more advisable.
Since Lua scripts are "atomic", you can in theory run this script without stopping the app - as long as the script runs (which depends on the size of your dataset) the app's requests will be blocked. Put differently, this approach solves the concern of getting mixed key names (old & new). This, however, is probably not what you'd want to do in any case because a) your app may still error/timeout during that time but mainly because b) it will need to be able to handle both types of key names (i.e. running with old keys -> short/long pause -> running with new keys) making your code much more complex.
The if condition is not required if you're going to run the script only once and succeed.
Depending on the actual contents of your database, you may want to further filter out keys that should not be renamed.
To ensure compatibility, refrain from hardcoding / computationally generating key names - instead, they should be passed as arguments to the script.
You can run a migration script in your redis client language, using RENAME.
If you don't have any other control over the total of keys, you first issue a KEYS user:* to list all keys, then substring for getting the numeric id, then renaming.
You can issue all of this in a transaction.
So a little pseudocode and redis commands:
MULTI
KEYS user:*
For each key {
id = <Get id from key>
RENAME user:id user:id:data
}
EXEC
Got it ?

OSX: Hook file read event

I have a particular file I want to monitor for file read attempts by all applications on OSX. I'd like to be able to interrupt the requests so I could decide which applications have permission to read the file and which don't (by querying the user, or checking a cache of user responses). Is this possible with the OSX API? If not, is it even possible to get a list of which applications or processes do read a file?
I'm not saying there's no way to do this, but what #Jonathan is talking about isn't it.
That API is for tracking the creation, change, and destruction of files. Notably this tool is used by things like Spotlight to watch activity on the filesystem for new, interesting files.
But, wisely, reading isn't one of the events it tracks.
And even if reading WAS tracked, it is still the wrong mechanism, as it's a notification system after the fact, not in line with the call itself.
I seriously doubt what you want is possible the way you describe it.
With Access Control Lists, you can limit access at the user level (Fred can read the file, but Bob can not). This is a setting on the file itself. But there's no mechanism to allow Bobs App1 to read a file, while Bobs App2 can not, since there's really no formal mechanism of "application identity" beyond the command to executed, or whatever the program "says" its name is (both of which can be spoofed if motivated enough).
However, feel free to crawl the Darwin sources -- no doubt the answer is buried in there somewhere near the open(2) call.
EDIT, regarding comment.
What are you trying to do? What's the overall context?
Another thing that you may want to try is to use FUSE.
FUSE is a utility that let's you have "user space filesystems". People use FUSE for many purposes, like reading NTFS volumes, or mounting remote system via SSH.
They have a simple example, that gives you a skeleton that you can fill in for your purposes.
For most of the use cases, you'll simple defer to the system. However, for OPEN you will add your logic. Then you could point your FUSE utility at a directory, and "mount it". Then all of the files below that directory can use your new behavior.
I'm still not sure how you will identify Apps by name, but if it's not a real "security" issue, just for local control, I imaging you can come up with something. Activity Monitor has apps names, so they must be available, and FUSE will be running within the process space (I think), rather than through some external mechanism.
All that said, I think FUSE is your best bet, but it's probably not appropriate if you want to do this to "any file" with no preparation by the user (like not installing FUSE). If you wanted to do "any file", your FUSE system would need to be mounted at root, and then you'll simply have a full "clone" of the filesystem, with those files from the normal root "unprotected", while those from your new FUSE root will be protected. So, if someone wanted to NOT use your FUSE system, the real file is readily available to them through the actual file location.
If not, is it even possible to get a list of which applications or processes do read a file?
The command-line tool fs_util allows you to monitor filesystem activity, including reading.
The writings of Amit Singh should come in very handy. He explored the API that provides FileSystem events a few years ago, and provided a sample tool that allows you to intercept FS events. It's open source!
If i remember his conclusion properly, their isn't an official API, but you can use apple's tools to achieve what you want.

How to reliably handle files uploaded periodically by an external agent?

It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.
For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.
On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.
Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.
I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.
So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?
The key is to do the initial juggling at the sending end. All the sender needs to do is:
Store the file with a unique filename.
As soon as the file has been sent, move it to a subdirectory called e.g. completed.
Assuming there is only a single receiver process, all the receiver needs to do is:
Periodically scan the completed directory for any files.
As soon as a file appears in completed, move it to a subdirectory called e.g. processed, and start working on it from there.
Optionally delete it when finished.
On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.
Multiple Receivers
If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?
Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to processed before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file in completed and try to move it, only one will succeed. All you need to do is make sure you check the return value of rename(), or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan the completed directory again.
If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.
Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package
File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.