How can I have Alluxio show all the not-yet-accessed files in the directory? - amazon-s3

When mounted an s3 bucket under alluxio://s3/, the bucket already has objects. However, when I get the directory list (either by alluxio fs ls or ls the fuse-mounted directory or on the web ui) i see no files. When I write a new file or read an already existing object via Alluxio, it appears in the dir list. Is there a way I can have Alluxio show all the not-yet-accessed files in the directory? (rather than only showing files after writing or accessing them)

a simple way is to run bin/alluxio fs loadMetadata /s3 to force refresh the Alluxio directory. There are other ways to trigger it, checkout “How to Trigger Metadata Sync” section in this latest blog:
https://www.alluxio.io/blog/metadata-synchronization-in-alluxio-design-implementation-and-optimization/

Related

Downloading only new files in GoodData

How can I use the "Download File" component to only download new files or files that have been updated remotely?
Consider a graph like the following:
where File Download is defined as:
I have many *.csv files in ${S3_OR_DATA_DIR_LOCATION}; I'm adding one every day).
How can I make sure GoodData only downloads new files AND files that have been updated? Would making the option "Overwrite existing files" False do it? Or would that only download new files and not update existing files that have been updated?
The File Download CloudConnect component by itself does not support action as downloading only a new file(s), which appeared in the source folder as it does not have any previous state remembering mechanism implemented, but as it has input port, it is possible to implement such mechanism yourself with using of File List CloudConnect component with a little help of Reformat, some Joiner and CSV Writer CloudConnect components. This way you can determine the content of the source folder and write it there in a plain text file. The mechanism can be designed the way, that the next processing would read the state file from the previous run and determine, what a new files are and then sends a list of a new files to the File Download CloudConnect component’s input port.
The another way how to process only a new files, which is way simpler then the process described in the previous article and therefore commonly used, is based on taking advantage of folder structures in the source folder, where there would be a dedicated folder for a new files and another dedicated folder for already processed files. The CloudConnect ETL process itself would then read a new files from its dedicated source folder and the last stage of the ETL process would contain File Copy/Move CloudConnect component used for transferring of the already processed new files from its dedicated folder to folder containing all already processed files.

Sync with S3 with s3cmd, but not re-download files that only changed name

I'm syncing a bunch of files between my computer and Amazon S3. Say a couple of the files change name, but their content is still the same. Do I have to have the local file removed by s3cmd and then the "new" file re-downloaded, just because it has a new name? Or is there any other way of checking for changes? I would like s3cmd to, in that case, simply change the name of the local file in accordance with the new name on the server.
s3cmd upstream (github.com/s3tools/s3cmd master branch) and 1.5.0-rc1 latest published version, can figure this out, if you used a recent version to put the file into S3 in the first place that used the --preserve option to store the md5sum of each file. Using the md5sums, it knows that you have a duplicate (even if renamed) file locally, and won't re-download it, but instead will do a local copy (or hardlink) from the file system name to the name from S3.

Flowgear access to files on the local file system

I am creating a Flowgear workflow that needs to process a raft of XML data.
I have the xml data contained in a set of .xml files (approximately 400 files) in a folder on my local machine hard-drive and I want to read them into a workflow, run an XSLT transform and then write out the resultant XML to another folder on the same local hard-drive.
How do I get the flowgear workflow to read these files?
It depends on the use case, the File Enumerator works exceptionally well to loop (as in for-each) through each file. Sometimes, one wants to get a list of files in a particular folder and check whether a file has been found or not. For this, I would recommend a c# script to get a list of files with code:
Directory.GetFiles(#"{FilePath}", "*.{extension}", SearchOption.TopDirectoryOnly);
Further on, use the File node to read, write, or delete files from a file directory.
NB! You will need to install a DropPoint on the PC/Server to allow access to the files. For more information regarding Drop Points, please click here
You can use a File Enumerator or File Watcher to read the files up. The difference is that a File Enumerator will enumerate all files in a folder once, the File Watcher will watch a folder indefinitely and provide new files to the workflow as they are copied into the folder.
You can then use the File node to write the files back the the file system.

Is it possible to sync a single file to s3?

I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.

Empty files on S3 prevent from downloading using s3cmd and s3sync

I am trying to setup a backup/restore using S3. The upload sync worked well using s3sync. However, next to each folder there is an empty file with matching name. I read somewhere that this is created to define the folder structure but I am not sure about that as it doesn't happen if I create a folder using a different method s3fox etc.
These empty files prevent me from restoring the directories/files. When I do s3cmd sync, I get an error message "can not make directory: File exists" as it first creates that empty file and that fails when trying to create the directory. Any ideas how I can solve this problem?