Adding fields to Cloudwatch without using JSON - amazon-cloudwatch

So I have typical run of the mill logs from Nginx and tomcat servers which are just single line text files with typical log format. I have changed the tomcat access logs to output pipe delimited fields so I can easily process them using some unix scripts. I'd like to get rid of my unix scripts and move to using cloudwatch to process my logs in a similar manner, however I found out that cloudwatch really doesn't understand anything beyond timestamp, message, and logstream by default.
It will add fields using JSON, but JSON is verbose when it comes to log files. I'd like to just let it process a CSV file which seems like an obvious alternative to JSON. I'm willing to change my log format to meet a requirement like that, but I can't find any information about how I could do that.
Is my only option to translate my logs into JSON in order to add fields to cloudwatch? I am aware of the parse command, but I find that cumbersome to reconstitute my fields every time I want to build a query. Especially since these will mostly be access logs which will have numerous fields. I have aws cloudwatch log agent setup on my systems and I'm currently sending these logs to cloudwatch.

The closest thing there is to handling space delimited log files is to use Metric Filters. Or at least that's how the authors of CloudWatch designed it.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
The best examples of this is here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CountOccurrencesExample.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ExtractBytesExample.html
Not sure if this is going to work for what I'm trying to do with logs, but it's a start. And it's the closest thing to a proper answer. If you want it done right, you gotta do it yo'self.

Related

Dask dataframe read parquet format fails from http

I have been dealing with this problem for a week.
I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic.
However ddf.read_parquet is Ok with "webhdfs://"
I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.
Although the comments already partly answer this question, I thought I would add some information as an answer
HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.

Does Informatica Powercenter provide API to access session logs

Question - Does Informatica PowerCenter provide API to access session logs - I believe No but wanted to through in forum to be sure?
Objective -Actually I want to extract session logs and process them through Logstash and perform reactive analytics periodically.
Alternate - The same could be solved using Logstash input plugin for Informatica - but I did not find that either.
Usage - This will be used to determine common causes, analyze usage of cache at session level, throughput, and any performance bottlenecks.
You can call Informatica Webservice's getSessionLog. Here's a sample blog post with details: http://www.kpipartners.com/blog/bid/157919/Accessing-Informatica-Web-Services-from-3rd-Party-Apps
I suppose that the correct answer i 'yes', since there is a command line tool to convert logfiles to txt or even xml format.
The tool for session/workflow logs is called infacmd with the 'getsessionlog' argument. You can look it up in the help section of you powercenter clients or here:
https://kb.informatica.com/proddocs/Product%20Documentation/5/IN_101_CommandReference_en.pdf
That has always been enough for my needs..
But there is more to look into: when you run this commandline tool (which is really a BAT file) a java.exe does the bulk of the processing in a sub-process. The jar files used by this process could potentially be utilized by somebody else directly, but I don't know if it has been documented anywhere publicly available....?
Perhaps someone else knows the answer to that.

How to add more data to be stored in jenkins rest api

To make the question simple, I know that I can get some build information with https://jenkins_server/...///api/json|xml|python. And I get a lot of information for that build record.
However, I want to add more information to that build record. For example, the docker image created, or the tickets or files changed from last build to create release note, ... etc. How do I do that?
For now, I use a script to create a json file as an artifact and call that json file to get these information, but it seems a duplicate if I can add more data to the jenkins build object directly.
The Jenkins remote access API is designed to provide access to generic Jenkins-internal information, like build numbers, timestamps, fingerprints etc.
If you want to add your own data there, then you must extend Jenkins accordingly, e.g., by designing a plugin that advertises your (custom) information items as standard Jenkins-"internal" data. If you want to do that, you may want to have a look at they way fingerprint information is handled (I found that quite instructive).
However, I'd recommend that you stick with your current approach, and keep generic Jenkins-internal information separated from Job-specific data. It is less effort and clearly separates your own data from Jenkins' data.

errors in transformation Kettle

I want to get errors generated by system in Pentaho Kettle and expose it as results in transformation or job, for example i want to get errors of the HL7 input from log and expose it as results in the next step.
I want to get errors generated by system
You mean like Apache or MySQL errors? If that's the case, you may just point a Pentaho transformation to those files. They usually have a default place like /var/logs/apache2 and that would be pretty easy to read.
The part that's not that easy is if you want to parse those errors into something easier to analyse. For that I would use "load file in memory" and some "regex evaluation" steps to get the data you want out of the raw text.
But, there are better solutions for reading your logs and analyzing errors.
See LogStash for more info or similar products.
You could you save those results in a temporary csv file that the next step(s) can consume.
If you go with this solution I would recommend:
Adding a unique jobID or identifier in the file name to ensure that your next step is reading the right file.
Adding a step at the end that removes old temp files

Visualize Apache Log files

I want to quickly and efficiently analyze my apache log files.
Is there a software that would read in apache log files and display visually (no parser) with a menu statistics such as distinct IPs, requests type, ...?
Since you have (odd) requirement to not have a parser, you'll need to output your logs in a descriptive way (e.g. json). So, update your apache config to write json, then use a shipper like filebeat to send them to a store like elasticsearch where you can visualize them with a tool like kibana.
The parser (logstash, in ELK's case) will allow you to add more value to your log data, so I wouldn't dismiss it so quickly.