I have several NodeJS applications running on ECS Fargate and the logs are being shipped to CloudWatch. I'd like to have custom fields show up in CloudWatch for each log message such as application.name and application.version and possibly ones created depending on the content of the log message. Say my log message is [ERROR] Prototype: flabebes-flower crashed and I'd like to pull out the log level ERROR and the name of the prototype flabebes-flower. Is it possible to have these fields in CloudWatch? If so, how can I accomplish this? I know how to achieve this using Filebeat processors and shipping the logs to Elasticsearch, I have that solution already but I'd like to explore the possibility of moving away from Elasticsearch and just using CloudWatch without having to write my own parsers.
There are basically two options:
If your log messages always have the same format, you can use the parse feature of CloudWatch Log Insights to extract these fields, e.g.,
parse #message "[*] Prototype: * crashed" as level, prototype
If the metadata that you want to extract into custom fields is not in a parsable format, you can configure your application to log in JSON format and add the metadata to the JSON log within your application (how depends on the logging library that you use). Your JSON format can then look something like this:
{"prototype":"flabebes-flower","level":"error","message":"[ERROR] Prototype: flabebes-flower crashed","timestamp":"27.05.2022, 18:09:46"}
Again with CloudWatch Log Insights, you can access the custom fields prototype and level. CloudWatch will automatically parse the JSON for you, there is no need to use the parse command as in the above method.
This allows you, e.g., to run the query
fields #timestamp, message
| filter level = "error"
to get all error message.
I'm trying to get the duration and resolution of a video that I stored in S3 but I'm having some issues.
I tried it using mediaInfo and the process described in Extracting Video Metadata using Lambda and Mediainfo | AWS Compute Blog, but I get an access denied error when the Lambda function fires. Later I tried some workarounds for that but it just keep me returning errors of many types. Finally I tried to use the node-mediainfo package, but it returns a time out error.
Is there another way to do it?
Is there any other way to insert data in BigQuery via API apart from via streaming data i.e. Table.insetAll
InsertAllResponse response = bigquery.insertAll(InsertAllRequest.newBuilder(tableId)
.addRow("rowId", rowContent)
.build())
As you can see in the docs, you also have 2 other possibilites:
Loading from Google Cloud Storage, BigTable, DataStore
Just run a job.insert method from the job resource and set as metadata the field configuration.load.sourceUri.
In the Python Client, this is done in the method LoadTableFromStorageJob.
You can therefore just send your files to GCS for instance and then have an API call to bring the files to BigQuery.
Media Upload
This is also a job.load operation but this time the HTTP request also carries binaries from a file in your machine. So you can pretty much send any file that you have in your disk with this request (given the format is accepted by BQ).
In Python, this is done in the resource table Table.upload_from_file.
I'm using Fine Uploader with S3 and I have a client whose computer time is off, resulting in an S3 RequestTimeTooSkewed error. Ideally, my client would have the right time, but I'd like to have my app be robust to this situation.
I've seen this post - https://github.com/aws/aws-sdk-js/issues/399 on how to automatically retry the request. You take the ServerTime from the error response and use that as the time in the response. An alternative approach would just be to get the time from a reliable external source every time, avoiding the need for a retry. However, I'm not sure how to hook either approach into S3 Fine Uploader. Does anyone have an idea of how to do this?
A solution was provided in Fine Uploader 5.5 to address this very situation. From the S3 feature documentation:
If the clock on the machine running Fine Uploader is too far off of the current date, S3 may reject any requests sent from this machine. To overcome this situation, you can include a clock drift value, in milliseconds, when creating a new Fine Uploader instance. One way to set this value is to subtract the current time according to the browser from the current unix time according to your server. For example:
var uploader = new qq.s3.FineUploader({
request: {
clockDrift: SERVER_UNIX_TIME_IN_MS - Date.now()
}
})
If this value is non-zero, Fine Uploader S3 will use it to pad the x-amz-date header and the policy expiration date sent to S3.
I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.