Awstats - LogFormat doesn't match the Amazon S3 log file contents - amazon-s3

I'm trying to setup Awstats to formate Amazon S3 log files,but it keeps saying the log doesn't match the LogFormat, below is the configuration and log content:
LogFormat="%other %extra1 %time1 %host %logname %other %method %url
%otherquot %code %extra2 %bytesd %other %extra3 %extra4 %refererquot
%uaquot %other"
0dfbd34f831f30a30832ff62edcb8a93158c056f27cebd6b746e35309d19039c
looxcie-data1 [18/Dec/2011:04:30:15 +0000] 75.101.241.228
arn:aws:iam::062105025988:user/s3-user E938CC6E4B848BEA
REST.GET.BUCKET - "GET
/?delimiter=/&prefix=data/prod/looxciemp4/0/20/&max-keys=1000
HTTP/1.1" 200 - 672 - 44 41 "-" "-" -
Then I execute the command and get following result:
root#test:/usr/local/awstats/wwwroot/cgi-bin# perl awstats.pl -update - config=www.awstats.apache.com
Create/Update database for config "/etc/awstats/awstats.www.awstats.apache.com.conf" by AWStats version 7.0 (build 1.971)
From data in log file "/var/log/httpd/access.log"...
Phase 1 : First bypass old records, searching new record...
Searching new records from beginning of log file...
Jumped lines in file: 0
Parsed lines in file: 1
Found 0 dropped records,
Found 0 comments,
Found 0 blank records,
Found 1 corrupted records,
Found 0 old records,
Found 0 new qualified records.
Can anyone help to figure it out?
===========================================
I found that the format "%logname" can not match such name
arn:aws:iam::062105025988:user/s3-user
It is wired, but "%lognamequot" is able to match "arn:aws:iam::062105025988:user/s3-user";
This is the cause of this problem;
But our system log file does include logname like arn:aws:iam::062105025988:user/s3-user;
Is there anyone can help to figure it out why it doesn't match it?

Related

New Windows installation, CLI not picking up .bigqueryrc

I had a Windows 10 machine working great with bq/gloud and gsutils all working great. Had to move to new machine and setting all of this back up. I cannot seem to get bq to recognize the entries in my .bigqueryrc file which was copied from my working machine. On my working machine, I am not using BIGQUERYRC environmental file, but just HOME, ie ~/.bigqueryrc
I say it does not seem to recognize is because if I introduce a random character I get an error about my .bigqueryrc file.
Let me first show that bq seems to be operating, overall, as expected
C:\Users\boyer>bq ls --project_id=broad-tapestry-sbx-boyer-14
datasetId
---------------
Broad_DataGov
Broad_EDW
Broad_Kitchen
Broad_Lake
Broad_Tableau
Broad_Utils
SHELL> bq ls brings nothing. The contents of my .biqqueryrc file is as thus:
credential_file = C:\Users\boyer\AppData\Roaming\gcloud\legacy_credentials\boyer#broadinstitute.org\singlestore_bq.json
project_id = broad-tapestry-sbx-boyer-14
[query]
--use_legacy_sql=false
Here is showing where my .bigqueryrc file is:
:\Users\boyer>ls -ls ~/.bigqueryrc
1 -rwx------+ 1 CHARLES+boyer CHARLES+boyer 198 Dec 22 12:48 /cygdrive/c/Users/boyer/.bigqueryrc
So, I try this and nothing happens either:
C:\Users\boyer>bq ls --bigqueryrc=c:/users/boyer/.bigqueryrc
C:\Users\boyer>bq ls --bigqueryrc=c:\users\boyer\.bigqueryrc
C:\Users\boyer>
But here is the error I purposefully introduced to see if it is actually looking at the file and it seems it is finding it.
C:\Users\boyer>bq ls --bigqueryrc=c:\users\boyer\.bigqueryrc
Unknown flag x found in bigqueryrc file in section global
The error above was introduced by adding a random 'x' character to the top of the file:
x
credential_file = C:\Users\boyer\AppData\Roaming\gcloud\legacy_credentials\boyer#broadinstitute.org\singlestore_bq.json
project_id = broad-tapestry-sbx-boyer-14
[query]
--use_legacy_sql=false

Amazon CloudWatch log agent ignoring first character of log lines

If I add a batch of n test records to my log file, the awslogs agent is erroneously ignoring the first character of the first line of the batch of rows that I add. This is fully reproducible. So if I use vi to append the following test log lines:
2017-06-16 15:20:01,123 this line will not get correctly parsed. first character in date string will be skipped.
2017-06-16 15:21:23,456 this will get parsed. 1
2017-06-16 15:22:23,456 this will get parsed. 2
2017-06-16 15:23:23,456 this will get parsed. 3
2017-06-16 15:24:23,456 this will get parsed. 4
2017-06-16 15:25:23,456 this will get parsed. 5
2017-06-16 15:26:23,456 this will get parsed. 6
2017-06-16 15:27:23,456 this will get parsed. 7
The leading 2 in the first row gets omitted by the log agent. In the CloudWatch Logs web console, the event shows up as 017-06-16 15:20:01,123 this line will..., the datetime string does not get successfully parsed, and the log event must use the timestamp of the previous log.
In the common scenario where I add log events to the file one at a time, the first letter of each line is ignored and the timestamp strings do not get correctly parsed. If I append multiple lines in vi before hitting :w save, only the first line experiences this problem and the other lines in the batch get ingested correctly.
I created the log file (as a test) with touch and have only added lines manually with vi so I don't think this is a file encoding problem.
I'm using a mostly standard default configuration.
My CloudWatch Agent Config File:
[general]
state_file = /var/awslogs/state/agent-state
[/var/log/myapp/app.log]
file = /var/log/myapp/app.log
log_group_name = MyAppLogGroup
log_stream_name = MyAppLogStream
datetime_format=%Y-%m-%d %H:%M:%S,%f
Then I download the latest setup script from https://s3.amazonaws.com//aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
And run sudo ./awslogs-agent-setup.py -n -r us-west-2 -c cloudwatch_logs.config
Try setting the
initial_position = start_of_file
option in your config file explicitly, do you get the same behavior?

Uploading job fails on the same file that was uploaded successfully before

I'm running regular uploading job to upload csv into BigQuery. The job runs every hour. According to recent fail log, it says:
Error: [REASON] invalid [MESSAGE] Invalid argument: service.geotab.com [LOCATION] File: 0 / Offset:268436098 / Line:218637 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I went to line 218638 (the original csv has a headline, so I assume 218638 should be the actual failed line, let me know if I'm wrong) but it seems all right. I checked according table in BigQuery, it has that line too, which means I actually successfully uploaded this line before.
Then why does it causes failure recently?
project id: red-road-574
Job ID: Job_Upload-7EDCB180-2A2E-492B-9143-BEFFB36E5BB5
This indicates that there was a problem with the data in your file, where it didn't match the schema.
The error message says it occurred at File: 0 / Offset:268436098 / Line:218637 / Field:2. This means the first file (it looks like you just had one), and then the chunk of the file starting at 268436098 bytes from the beginning of the file, then the 218637th line from that file offset.
The reason for the offset portion is that bigquery processes large files in parallel in multiple workers. Each file worker starts at an offset from the beginning of the file. The offset that we include is the offset that the worker started from.
From the rest of the error message, it looks like the string service.geotab.com showed up in the second field, but the second field was a number, and service.geotab.com isn't a valid number. Perhaps there was a stray newline?
You can see what the lines looked like around the error by doing:
cat <yourfile> | tail -c +268436098 | tail -n +218636 | head -3
This will print out three lines... the one before the error (since I used -n +218636 instead of +218637), the one that had the error, and the next line as well.
Note that if this is just one line in the file that has a problem, you may be able to work around the issue by specifying maxBadRecords.

Issue when loading data from cloud storage, at least an error message improvement is needed

When I try to load multiple files from cloud storage larger jobs almost always fail. When I try to load an individual file that works, but loading batches is really much more convenient.
Snippet:
Recent Jobs
Load 11:24am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409
Load 11:23am
gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409
Load 11:22am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 40 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 40 / Line:2 / Field:1, Bad character (ASCII 0) encountered: field starts with: <5C���>}�>
File: 40 / Line:3 / Field:1, Bad character (ASCII 0) encountered: field starts with: <����W�o�>
File: 40 / Line:4, Too few columns: expected 7 column(s) but got 2 column(s). For additional help:
File: 40 / Line:5, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:6, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:7, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 40 / Line:8 / Field:1, Bad character (ASCII 0) encountered: field starts with: <��hy�>
The worst with this problem is that I don't know which file is "File: 40" the order seems random, otherwise I could remove that file and load the data, or try to find the error in the file.
I also strongly doubt that there even is an actual file error, for example in the above case when I removed all files but _0001 and _0002 (that worked fine to load as single files) I still get this output:
Recent Jobs
Load 11:44am
gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409
Errors:
File: 1 / Line:1 / Field:1, Bad character (ASCII 0) encountered: field starts with: <�>
File: 1 / Line:2 / Field:3, Bad character (ASCII 0) encountered: field starts with:
File: 1 / Line:3, Too few columns: expected 7 column(s) but got 1 column(s). For additional help:
File: 1 / Line:4 / Field:3, Bad character (ASCII 0) encountered: field starts with:
Sometimes though the files load just fine, otherwise I'd expect that multiple file loading was all wrecked.
Info:
Average file size is around 20MB, usually a directory is 70 files somewhere between 1 and 2 GB.
It looks like you're hitting a BigQuery bug.
When BigQuery gets a load job request with a wildcard pattern (i.e. gs://foo/bar*) we first expand the pattern to the list of files. Then we read the first one to determine the compression type.
One oddity with GCS is that there isn't a real concept of a directory. That is gs://foo/bar/baz.csv is really bucket: 'foo', object: 'bar/baz.csv'. It looks like you have empty files as placeholders for your directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).
This empty file doesn't play nicely with the bigquery probe-for-compression type, since when we expand the file pattern, the directory dummy file is the first thing that gets returned. We then open the dummy file, and it doesn't appear to be a gzip file, so we assume the compression type of the entire load is uncompressed.
We've filed a bug and have a fix under testing. Hopefully the fix will be out next week. In the mean time, your options are to either expand the pattern yourself, to use a longer pattern that won't match the directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or to delete the dummy directory file.

APC OpCode - apc.filters - Excluding a directory from cache

I have recently implemented APC on my KnownHost SSD VPS and it is working very well (WordPress sites).
One thing I would like to do is exclude a directory (or directories) from the cache.
I have read the documentation on apc.filters, and it seems confusing to me on whether directories or just file types can be excluded. I have also done extensive searching on the web, and have not found a working example of excluding a directory.
I have tried numerous variations for apc.filters, and have yet to find one that will exclude my directory.
So if my directory is located on this path in the server:
/home/my_user/public_html/my_directory
What would the correct value be for apc.filters to exclude the "my_directory" sub-directory?
Moved from the question
UPDATE: I found the answer (with help from KnownHost)
The correct syntax to exclude one directory is:
apc.filters = "-my_directory/.*";
Multiple directories are:
apc.filters = "-my_directory/.*,-my_directory2/.*";
I hope this can help someone out there, as I could not get it right, or find any information on it.
Thanks
In the interest of complete information, here are my APC runtime settings:
apc.cache_by_default 1
apc.canonicalize 1
apc.coredump_unmap 0
apc.enable_cli 0
apc.enabled 1
apc.file_md5 0
apc.file_update_protection 2
apc.filters
apc.gc_ttl 3600
apc.include_once_override 0
apc.lazy_classes 0
apc.lazy_functions 0
apc.max_file_size 2M
apc.mmap_file_mask /tmp/apc.XXXXXX
apc.num_files_hint 3000
apc.preload_path
apc.report_autofilter 0
apc.rfc1867 0
apc.rfc1867_freq 0
apc.rfc1867_name APC_UPLOAD_PROGRESS
apc.rfc1867_prefix upload_
apc.rfc1867_ttl 3600
apc.serializer default
apc.shm_segments 1
apc.shm_size 512M
apc.slam_defense 0
apc.stat 1
apc.stat_ctime 0
apc.ttl 7200
apc.use_request_time 1
apc.user_entries_hint 4096
apc.user_ttl 7200
apc.write_lock 1
Comments from other users suggest that it may also be necessary to set apc.cache_by_default=0