Extract date time from Apache Combined log format using AWS Logs and Cloudwatch - apache

We're using awslogs to collect Apache Combined formatted logs into Cloudwatch. It's all capturing fine, but we're getting timestamp could not be parsed from message error.
An example log entry:
::ffff:10.0.0.1 - blahblah [17/Aug/2017:20:31:07 +0000] "GET /favicon-16x16.png HTTP/1.1" 304 - "http://blahblah:3000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"
Our config for this set of log files looks like this, including our datetime_format entry:
[access_logs]
log_group_name = cromwell
log_stream_name = react-172.31.43.245-access
file = /home/admin/aperian-react/log/*access.log
datetime_format = "%d/%b/%Y:%H%M:%S %z"
multi_line_start_pattern = ::ffff:
time_zone = UTC
encoding = ascii
As you can see, the datetime is mid-line. This is different from most examples for syslogs, etc. We could change our log format, but we'd prefer not to since they flow into other systems as well.

Our dateformat_string was missing a colon.😒 😢
datetime_format = "%d/%b/%Y:%H%M:%S %z" # wrong
datetime_format = "%d/%b/%Y:%H:%M:%S %z" # correct

Related

Apache Log grok pattern

Can anybody please help with the grok pattern for below example of logs?
85.85.85.85 webmail.company.com "CN=First Last/O=Company/C=CZ" [14/Dec/2020:05:58:18 +0100] "GET /mail/User.nsf/iNotes/Proxy/?OpenDocument&Form=s_ReadViewEntries&PresetFields=DBQuotaInfo;1,FolderName;($Inbox),UnreadCountInfo;1,SearchSort;DateD,s_UsingHttps;1,noPI;1&TZType=UTC&Start=1&Count=23&resortdescending=6 HTTP/1.1" 200 2054 "https://webmail.company.com/mail/User.nsf/iNotes/Proxy/?OpenDocument&Form=l_ScriptFrame&l=en&gz&CR&MX&TSF=20170318T181650,92Z&TSX=20180206T185427,18Z&EFF=%2FiNotes%2FForms9_x&charset=UTF-8&charset=UTF-8&KIC&ua=safari&pt" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" 125 INOTES_LOGIN_ID=First%20Last; Shimmer=SI_TLM:20210209T072811%2C40Z&ST_Counter:3&LAO:mail&SAB:1&CS_TLM:20210209T072831%2C15Z&V_TLM:20210210T080147%2C82Z&DMS:5&ui:X&MOTLM:20210129T113159%2C00Z&DBQS:1503571%2C%207168000%2C%206963200%2C%200%2C%201503571&SPRKL:1&KOSCZ:GTB&FISD:1; INOTES_LOGIN_ID=First%20Last; DWAShared=0; DWAMode=0; INOTES_LOGIN_ID=First%20Last; DWAShared=0; DWAMode=0; LtpaToken2=FpoGJJz33bYLI+CtWy6OlIgoTJouNGEiduvxvQbcN8HRI7K6LThCsb1Dl8CzN72Zi05RGOUmQRMiOQcTk1norKHi6SbkEGI6GlXzjSIweBRSc8c+XPyAwA44PKPbu3WzrPfR0+uoC0sgTPvochvQ/VfPL/sSaqUFoRswRwyI+UeaOwTs/DvKiWLCpiKrVkFk3SmDjrxPBHb/WiL5nDkpp8Dsjjxnlo4vpx7BdOoVNai1jybvHkW28KXxkb21o8SSpmU7ZFdHyZFjDWCYuuCVOx7asV/q4a3lWdxlPfWdPcUguHML+xDmsrMPm6fTUSKeKIKdQEPr6VDmitBi7Z5URIlkRrUyslkTcc28y6fQir3Y20Hc9TmOvwaBlG/ehnpv; LtpaToken=0x4JJ4oWKojdqoz08Ng+MRUkkJq2vYGLGN9lp8HL8FxbD+xnivE7qzCzf92Q6x5OAPOBFRNgxd3Qg225zLwnJFWO0lGeIweH8VDgyWOMImNe6E9z9HBnQAN43vQ2uwtpv3X5E5DN0oLIPKLxAkqsHUDJqJ0SE6NZ6UnfLoR82JyjZVC/s6QEov5DNdpAY/o2Gxh0vWmE+wuQGuCh4mVCIP9KU/dbX4F0Ld9JEExzIpkdzKELibU2Akov0Krv0eWADSV++m/5ECLpaf6N6/VzkZEkt5XoOoL6OD/6ni4zojvo3O+X9Bn7Mdk2MnsQ1AccIohj5eN8Oi81QbD0a9b7jw==; ShimmerS=ET:20210210T114045%2c00Z&R:0&AT:M" "D:/Lotus/Domino/Data/mail/User.nsf"
What I would need is Client IP (85.85.85.85), VirtualHostname (webmail / webmail.company.com) , User (part after CN=, First Last), Time (14/Dec/2020:05:58:18), URL (GET /mail/User.nsf/iNotes/Proxy/?OpenDocument&Form=s_ReadViewEntries&PresetFields=DBQuotaInfo;1,FolderName; ... ) and the Device Info ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" )
I know it should start with below, however I can't get anyhow the User name to proceed with [%{HTTPDATE:timestamp}] and possible next would be "(?:%{WORD:verb} %{NOTSPACE:request} and not sure how to get the Device info.
Any help would be appreciated!
%{IPORHOST:clientip} %{WORD:VirtualHost} ???
Since you have customized your log format, you have to build your own grok to match the log. You can use https://grokdebug.herokuapp.com/ to debug the pattern you're going to use and you can copy some patterns from https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns
Solved:
%{IPORHOST:clientip} %{IPORHOST:destination.domain} "CN=%{DATA:username}" [%{HTTPDATE:apache.access.time}] "(?:%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}|-)?" %{NUMBER:http.response.status_code:long} (?:%{NUMBER:http.response.body.bytes:long}|-) ("%{DATA:http.request.referrer}") ("%{DATA:user_agent.original}")

Amazon detects scrapy instantly. How to prevent captcha?

I am trying to scrape one web page from amazon with the help of Scrapy 2.4.1 over shell. Without any prior scraping amazon instantly askes for captcha entries.
I am setting another user agent as only prevention but have never before scraped the page:
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
Get one page:
>>> fetch('https://www.amazon.de/Eastpak-Provider-Rucksack-Noir-Black/dp/B0815FZ3C6/')
>>> view(response)
Results in a captcha question.
I also tried it with headers:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
>>> req = Request("https://www.amazon.de/Eastpak-Provider-Rucksack-Noir-Black/dp/B0815FZ3C6/", headers=headers)
>>> fetch(req)
This also results in a captcha question, while the main page can be scraped in this case.
How does amazon detect that this is a bot and how to prevent that?

I want to exclude some line in the logs read by filebeat and also want to add a tag by using processors in filebeat but it is not working

I want to remove the log lines containing the word "HealthChecker" in the given log below and also add some tags in the payload to be send to logstash.
My logs:
18.37.33.73 - - [18/Apr/2019:14:49:53 +0530] "GET /products?sort=date&direction=desc HTTP/1.1" 200 8543 "https://codingexplained.com/products/view/124" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.38 (KHTML, like Gecko) Version/10.0 Mobile/14A300 Safari/602.1"
20.4.2.88 - - [18/Apr/2019:14:49:54 +0530] "GET / HTTP/1.1" 200 100332 "-" "ELB-HealthChecker/2.0"
18.37.33.73 - - [18/Apr/2019:14:49:55 +0530] "GET /products?sort=date&direction=desc HTTP/1.1" 200 8543 "https://codingexplained.com/products/view/124" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.38 (KHTML, like Gecko) Version/10.0 Mobile/14A300 Safari/602.1"
20.4.2.88 - - [18/Apr/2019:14:49:56 +0530] "GET / HTTP/1.1" 200 100332 "-" "ELB-HealthChecker/2.0"
I have already tried giving this configuration inside the processor plugin inside filebeat.yml file but it still does not work.
My filebeat.yml file:
filebeat.modules:
- module: apache
access:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
var.paths: ["/location/apache_access_2017-09-28.log"]
# Input configuration (advanced). Any input configuration option
# can be added under this section.
processors:
- add_tags:
tags: [web, production]
target: "environment"
- drop_event:
when:
contains:
message: "ELB-HealthChecker"
filebeat.inputs:
# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.
- type: log
# Change to true to enable this input configuration.
enabled: false
output.console:
# Boolean flag to enable or disable the output module.
enabled: true
codec.json:
pretty: true
YAML to blame in your case. "Processor" is the top level element, so this would work:
filebeat.modules:
- module: apache
access:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
var.paths: ["/location/apache_access_2017-09-28.log"]
# Input configuration (advanced). Any input configuration option
# can be added under this section.
processors:
- add_tags:
tags: [web, production]
target: "environment"
- drop_event:
when:
contains:
message: "ELB-HealthChecker"
When in doubt about indentation, refer to filebeat.full.yml file.

Exporting format for csv in hive

I am running Hive 1.2. I have seen that to download a table I use
insert overwrite directory '/user/table'
row format delimited
fields terminated by ','
LINES TERMINATED BY '\n'
select * from mytable.table;
That works but I see that it downloads slighted differently when I just use the GUI in HUE.
I am downloading useragent fields in weblogs and the difference is here:
Downloading through the hive query and hadoop fs I get
["Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36"]
Downloading through HUE I get
"[""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36""]"
I prefer the format with the double quotes. What is called and how can I change my Hive query to get the quotes in this format?
thanks!

Apache logs showing strange ^# characters ? What does this mean ?

My apache logs are always interrupted by strange characters :
84.196.205.238, 172.23.20.177, 172.23.20.177 - - [05/May/2015:11:48:15 +0200] 0 www.sudinfo.be "GET /sites/default/files/imagecache/pagallery_450x300/552495393_google_street_view HTTP/1.1" 200 32620 "http://www.sudinfo.be/247263/article/culture/medias/2011-11-23/google-street-view-en%C2%A0belgique-comment-trouver-votre-maison" "Mozilla/5.0 (Linux; U; Android 4.2.2; nl-be; GT-P3110 Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#efault/files/imagecache/pagallery_450x300/2015/01/13/1554554859_B974505865Z.1_20150113094316_000_GVR3PDRHQ.1-0.jpg HTTP/1.1" 200 26033 "http://www.bing.com/images/search?q=leonardo+dicaprio+Met+gala&id=06B1C7410D6458C6A698AC09F3F8C6B7915BFFDE&FORM=IQFRBA" "Mozilla/5.0 (iPad; CPU OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53"
Do you have any idea what can be the cause of this ?
If your web server is externally accessible then this is probably an artifact from an attempt to hack your server
ISTR ^# is how apache logs a "NULL" zero byte. These are used to pad attacks such as buffer overflow
You may like to look at counter measures such as mod_security
https://github.com/SpiderLabs/ModSecurity/wiki/ModSecurity-Frequently-Asked-Questions-%28FAQ%29
I hope it is obvious that a full patched server and application stack is more likely to be able to withstand random attack attempts like this
Ok finally found out what the problem was. My log files are written on a Network filesystem and my bash client just had problems to read it because of the network.
False alarm, everything still safe. Thanks for the help.