AWS CloudWatch Agent not uploading old files - amazon-cloudwatch

During the initial migration to AWS CloudWatch logging I also want legacy log files to be synced. However, it seems that only the current active file (i.e. still being updated) will be synced. The old files even match the file name format will be ignore.
So are there any easy way to upload legacy files?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html

Short answer: you should be able to upload all files by merging them. Or create a new [logstream] section for each file.
Log files in /var/log are usually archived periodically, for instance by logrotate. If the current active file is named abcd.log, then after a few days files will be created automatically with names like abcd.log.1, abcd.log.2...
Depending on your exact system and configuration, they can also be compressed automatically (abcd.log.1.gz,abcd.log.1.gz, ...).
The CloudWatch Logs documentation defines the file configuration parameter as such:
file
Specifies log files that you want to push to CloudWatch Logs. File can point to a specific file or multiple files (using wildcards such as /var/log/system.log*). Only the latest file is pushed to CloudWatch Logs based on file modification time.
Note : using a glob path with a star (*) will therefore not be sufficient to upload historical files.
Assuming that you have already configured a glob path, you could use the touch command sequentially on each of the historical files to trigger their upload. Problems :
you would need to guess when the CloudWatch agent has noticed each file before proceeding to the next
you would need to temporarily pause the current active file
zipped files are not supported, but you can decompress them manually
Alternatively you could decompress then aggregate all historical files in a single merged file. In the context of the first example, you could run cat abcd.log.* > abcd.log.merged. This newly created file would be detected by the CloudWatch agent (matches the glob pattern) which would consider it as the active file. Problem : the previous active file could be updated simultaneously and take the lead before CloudWatch notices your merged file. If this is a concern, you could simply create a new [logstream] config section dedication the historical file.
Alternatively, just decompress the historical files then create a new [logstream] config section for each.
Please correct any bad assumptions that I made about your system.

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?

syncing files with aws s3 sync that have a minimum timestamp

I am syncing a directory to an s3 bucket. It is a directory, so I only want it to check for files that were created/updated in the last 24 hours.
With GNU/Linux's rsync, you might do this by piping the output of 'find -mtime' to rsync; I'm wondering if anything like this is possible with aws s3 sync?
Edited to show final goal: I'm running a script that constantly syncs files to S3 from a web server. It runs every minute, first checks if there is already a process running (and exits if it does), then runs the aws sync command. The sync command takes about 5 minutes to run and usually gets 3-5 new files. This causes a slight load on the system, and I think if I just checked for files in the last 24 hours, it'd be much much faster.
No, the AWS Command-Line Interface (CLI) aws s3 sync command does not have an option to only include files created within a defined time period.
See: aws s3 sync documentation
It sounds like most of your time is being consumed by the check of whether files need to be updated. Some options:
If you don't need all the files locally, you could delete them after some time (48 hours?). This means less files will need to be compared. By default, aws s3 sync will not delete destination files that do not match a local file (but this can be configured via a flag).
You could copy recent files (past 24 hours?) into a different directory and run aws s3 sync from that directory. Then, clear out those files after a successful sync run.
If you have flexibility over the filenames, you could include the date in the filename (eg 2018-03-13-foo.txt) and then use --include and --exclude parameters to only copy files with desired prefixes.

SFTP file locking semantics [duplicate]

How can I make sure that a file uploaded through SFTP (in a Linux base system) stays locked during the transfer so an automated system will not read it?
Is there an option on the client side? Or server side?
SFTP protocol supports locking since version 5. See the specification.
You didn't specify, what SFTP server are you using. So I'm assuming the most widespread one, the OpenSSH. The OpenSSH supports SFTP version 3 only, so it does not support locking.
Anyway, even if your server supported file locking, most SFTP clients/libraries won't support SFTP version 5. Or even if they do, they won't support the locking feature. Note that the lock is explicit, the client has to request it.
There are some common workarounds for the problem:
As suggested by #user1717259, you can have the client upload a "done" file, once an upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after an upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for example of implementing this approach.
Also, some SFTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive (courtesy of #fakedad).
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, when you download an incomplete file.
A typical way of solving this problem is to upload your real file, and then to upload an empty 'done.txt' file.
The automated system should wait for the appearance of the 'done' file before trying to read the real file.
A simple file locking mechanism for SFTP is to first upload a file to a directory (folder) where the read process isn't looking. You can "make" an alternate folder using the sftp> mkdir command. Upload the file to the alternate directory, instead of the ultimate destination directory. Once the SFTP> put command completes, then do a move like this:
SFTP> move alternate_path/filename destination_path/filename. Since the SFTP "move" is just switching the file pointers, it is atomic, so it is an effective lock.

How to detect that a file is being uploaded over FTP

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.

SFTP file lock mechanism

How can I make sure that a file uploaded through SFTP (in a Linux base system) stays locked during the transfer so an automated system will not read it?
Is there an option on the client side? Or server side?
SFTP protocol supports locking since version 5. See the specification.
You didn't specify, what SFTP server are you using. So I'm assuming the most widespread one, the OpenSSH. The OpenSSH supports SFTP version 3 only, so it does not support locking.
Anyway, even if your server supported file locking, most SFTP clients/libraries won't support SFTP version 5. Or even if they do, they won't support the locking feature. Note that the lock is explicit, the client has to request it.
There are some common workarounds for the problem:
As suggested by #user1717259, you can have the client upload a "done" file, once an upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after an upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for example of implementing this approach.
Also, some SFTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive (courtesy of #fakedad).
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, when you download an incomplete file.
A typical way of solving this problem is to upload your real file, and then to upload an empty 'done.txt' file.
The automated system should wait for the appearance of the 'done' file before trying to read the real file.
A simple file locking mechanism for SFTP is to first upload a file to a directory (folder) where the read process isn't looking. You can "make" an alternate folder using the sftp> mkdir command. Upload the file to the alternate directory, instead of the ultimate destination directory. Once the SFTP> put command completes, then do a move like this:
SFTP> move alternate_path/filename destination_path/filename. Since the SFTP "move" is just switching the file pointers, it is atomic, so it is an effective lock.