syncing files with aws s3 sync that have a minimum timestamp - amazon-s3

I am syncing a directory to an s3 bucket. It is a directory, so I only want it to check for files that were created/updated in the last 24 hours.
With GNU/Linux's rsync, you might do this by piping the output of 'find -mtime' to rsync; I'm wondering if anything like this is possible with aws s3 sync?
Edited to show final goal: I'm running a script that constantly syncs files to S3 from a web server. It runs every minute, first checks if there is already a process running (and exits if it does), then runs the aws sync command. The sync command takes about 5 minutes to run and usually gets 3-5 new files. This causes a slight load on the system, and I think if I just checked for files in the last 24 hours, it'd be much much faster.

No, the AWS Command-Line Interface (CLI) aws s3 sync command does not have an option to only include files created within a defined time period.
See: aws s3 sync documentation
It sounds like most of your time is being consumed by the check of whether files need to be updated. Some options:
If you don't need all the files locally, you could delete them after some time (48 hours?). This means less files will need to be compared. By default, aws s3 sync will not delete destination files that do not match a local file (but this can be configured via a flag).
You could copy recent files (past 24 hours?) into a different directory and run aws s3 sync from that directory. Then, clear out those files after a successful sync run.
If you have flexibility over the filenames, you could include the date in the filename (eg 2018-03-13-foo.txt) and then use --include and --exclude parameters to only copy files with desired prefixes.

Related

How to transfer a file from one server to another every time this file has changed automatically

I have a Windows 2003 server with Filezilla installed on it. There are some folders with files that need to be updated to AWS S3 every time any file is changed, but it needs to be automatic.
Does anybody have any suggestions?
I don't think you can use FileZilla for that.
You could create a Windows scheduled task to run aws s3 sync on a recurring schedule. The file wouldn't copy to S3 immediately, but it would we automatic.

AWS CloudWatch Agent not uploading old files

During the initial migration to AWS CloudWatch logging I also want legacy log files to be synced. However, it seems that only the current active file (i.e. still being updated) will be synced. The old files even match the file name format will be ignore.
So are there any easy way to upload legacy files?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Short answer: you should be able to upload all files by merging them. Or create a new [logstream] section for each file.
Log files in /var/log are usually archived periodically, for instance by logrotate. If the current active file is named abcd.log, then after a few days files will be created automatically with names like abcd.log.1, abcd.log.2...
Depending on your exact system and configuration, they can also be compressed automatically (abcd.log.1.gz,abcd.log.1.gz, ...).
The CloudWatch Logs documentation defines the file configuration parameter as such:
file
Specifies log files that you want to push to CloudWatch Logs. File can point to a specific file or multiple files (using wildcards such as /var/log/system.log*). Only the latest file is pushed to CloudWatch Logs based on file modification time.
Note : using a glob path with a star (*) will therefore not be sufficient to upload historical files.
Assuming that you have already configured a glob path, you could use the touch command sequentially on each of the historical files to trigger their upload. Problems :
you would need to guess when the CloudWatch agent has noticed each file before proceeding to the next
you would need to temporarily pause the current active file
zipped files are not supported, but you can decompress them manually
Alternatively you could decompress then aggregate all historical files in a single merged file. In the context of the first example, you could run cat abcd.log.* > abcd.log.merged. This newly created file would be detected by the CloudWatch agent (matches the glob pattern) which would consider it as the active file. Problem : the previous active file could be updated simultaneously and take the lead before CloudWatch notices your merged file. If this is a concern, you could simply create a new [logstream] config section dedication the historical file.
Alternatively, just decompress the historical files then create a new [logstream] config section for each.
Please correct any bad assumptions that I made about your system.

How to copy an S3 bucket onto Kubernetes nodes

I wanted to copy an S3 bucket on Kubernetes nodes as a DaemonSet, as the new node will also get the s3 bucket copy as soon it gets launched,
I prefer an S3 copy to the Kubernetes node because copying S3 to directly to the pod as an AWS API would mean multiple calls as multiple pods require it and it will take time to copy content each time when the pod is launching.
Assuming that your S3 content is static and doesn't change often. I believe more than a DaemonSet it makes more sense to use a one time Job to copy the whole S3 bucket to a local disk. It's not clear how you would signal the kube-scheduler that your node is not ready until the S3 bucket is fully copied. But, perhaps you can taint your node before the job is finished and remove the taint after the job finishes.
Note also that S3 is inherently slow and meant to be used for processing (reading/writing) single files at a time, so if your bucket has a large amount of data it would take a long time to copy to the node disk.
If your S3 content is dynamically (constantly changing) then it would be more challenging since you would have to files in sync. Your apps would probably have to cache architecture where you would go to the local disk to find files and if they are not there, then make a request to S3.

How to filter or cleanup my S3 buckets clutters by log file?

I use S3 and amazon cloud front to put images.
When I go on amazon S3 interface, it's hard to find the folder where i have put my images because i need to scroll 10 minutes past all the buckets it creates every 15 minutes/hour. There are literally thousands.
Is it normal?
Did I put something wrong on the settings of S3 or of the cloud front file I connected to this S3 folder?
What should I do to delete them? It seems I can only delete them one by one.
See here a snapshot:
AND SO ON.....FOR THOUSANDS OF FILES UNTIL...
Those are not buckets, but are actually log files generated by S3 because you enabled logging for your bucket and configured it to save the logs in the same bucket.
If you want to keep logging enabled but make it easier to work with the logs, just use a prefix in the logging configuration or set up logging to use a different bucket.
If you don't need the logs, just disable logging.
See http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html for more details.

AWS EC2- Synching source code files with S3 - is it a proper approach?

On an app server in which a few source files change frequently, Is the following approach recommended?
Use a cron job with S3tools to sync the source files with S3 private bucket (every 15 mins for example).
On server start up - Use user data script to sync with the sources bucket to retrieve the latest sources.
Advantages:
1. No need to attach EBS for app server just to save a few files
2. Similar setup to all app servers
3. Sources automatically backed up.
4. As a byproduct, distributes code to multiple app servers automatically.
Disadvantages:
keeping source code on S3
other?
What do you think about this methodology? Is this the right way to use EC2 when source code change frequently (a few times a day) please recommend the best approach to run EC2 instances where sources change often.
I think you're better off using a proper source code repository, like Subversion or Git, rather than storing the source files on S3. That way you can have a central location for the source files while avoiding the update consistency problems that kdgregory mentioned.
You can put the source repository on one of your own servers outside of EC2, or host it on an EC2 instance (make sure the repository files are on an EBS volume in the latter case).
If you're going to be running a large number of EC2 instances, then it will be less effort to have them sync themselves from a central location (ie, you sync to private bucket, app-servers sync from that bucket).
HOWEVER, recognize that updates to an S3 bucket are atomic only at the object level, and more importantly, are not guaranteed to be immediately consistent (although I recall seeing a recent note that the us-west endpoint does offer read-after-write consistency).
This means that your app-servers may load a set of new files that are internally inconsistent -- some will be old, some will be new. If this is a problem for you, then you should implement a scheme that uploads directly to the app-servers, and ensures changeset consistency (perhaps by uploading to a temporary directory that is then renamed).