Removing liveness logs from FluentBit in EKS - amazon-eks

Trying to stop shipping liveness logs to AWS Cloudwatch to reduce charges on excessive logging. The grep filter doesn't seem to have any impact. What am I missing?
[SERVICE]
Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Parser docker
Docker_Mode On
Docker_Mode_Flush On
Docker_Mode_Parser cwagent_firstline
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Merge_Log_Key data
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels Off
Annotations Off
[FILTER]
Name grep
Match *
Exclude log /*liveness*/
[OUTPUT]
Name cloudwatch
Match *
region us-east-2
log_group_name application
log_stream_prefix fluentbit-
log_retention_days 14
auto_create_group true
Tried making changes to the configuration but the grep filter has no effect on the logs moved to AWS.
{"log":"{"level":"info","Remote Address":"::ffff:10.32.11.173 - -","Date":"2022-11-08T21:38:50.246Z","Method":"GET - /liveness - -","Status":"200","Response":"68 - 2.451 ms","Referrer":"- - kube-probe/1.21+"}\n","stream":"stdout","time":"2022-11-08T21:38:50.246629696Z"}

Related

Single file output from multiple source files with fluent-bit

We are using fluent-bit to capture multiple logs within a directory, do some basic parsing and filtering, and sending output to s3. Each source file seems to correspond to a separate output file in the bucket rather than a combined output.
Is it possible to send multiple input files to a single output file in fluent-bit, or is this simply how the buffer flush behavior works?
Here is our config for reference:
[SERVICE]
Daemon Off
Flush 1
Log_Level warn
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
Health_Check Off
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /tmp/fluentbit/
storage.max_chunks_up 128
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
storage.type filesystem
Mem_Buf_Limit 10MB
Buffer_Chunk_Size 2M
Buffer_Max_size 256M
Skip_Long_Lines On
Skip_Empty_Lines On
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
Merge_Log_Key msg-json
K8S-Logging.Parser On
K8S-Logging.Exclude On
Cache_Use_Docker_Id On
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes_
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes_labels
Add_prefix kubernetes_labels_
[FILTER]
Name aws
Match *
imds_version v1
az true
ec2_instance_id true
ec2_instance_type true
private_ip true
account_id true
hostname true
vpc_id true
[OUTPUT]
Name s3
Match *
bucket <bucket name redacted>
region us-east-1
total_file_size 100M
upload_timeout 60s
use_put_object true
compression gzip
store_dir_limit_size 500m
s3_key_format /fluentbit/team/%Y.%m.%d.%H_%M_%S.$UUID.gz
static_file_path On
It is possible to send multiple input files to single output file.
The issue here might be with your use of the s3_key_format.
Your current file name format is '/fluentbit/team/%Y.%m.%d.%H_%M_%S.$UUID.gz' and this has a UUID which causes each input file being written to a separate output file in S3.
To combine and send to single output file, just modify it to '/fluentbit/team/%Y.%m.%d.gz'

Hitachi Content Platform (HCP) S3 - How to I disable or delete previous versions?

I am (unfortunately) using Hitachi Content Platform for S3 object storage, and I need to sync around 400 images to a bucket every 2 minutes. The filenames are always the same, and the sync "updates" the original file with the latest image.
Originally, I was unable to overwrite existing files. Unlike other platforms, on HCP, you cannot update a file that already exists when versioning is disabled, it returns a 409 and won't store the file, so I've enabled versioning which allows the files to be overwritten.
The issue now is that HCP is set to retain old versions for 0 days for my bucket (which my S3 admin says should cause it to retain no versions) and "Keep deleted versions" is also disabled, but the bucket is still filling up with objects (400 files every 2 minutes = ~288K per day). It seems to cap out at this amount, after the first day it remains at 288K permanently (which seems like it's eventually removing the old versions after 1 day).
Here's an example script that simulates the problem:
# Generate 400 files with the current date/time in them
for i in $(seq -w 1 400); do
echo $(date +'%Y%m%d%H%M%S') > "file_${i}.txt"
done
# Sync the current directory to the bucket
aws --endpoint-url $HCP_HOST s3 sync . s3://$HCP_BUCKET/
# Run this a few times to simulate the 2 minute upload cycle
The initial sync is very quick, and takes less than 5 seconds, but throughout the day it becomes slower and slower as the bucket begins to get more versions, eventually taking sometimes over 2 minutes to sync the files (which is bad since I need to sync the files every 2 minutes).
If I try to list the objects in the bucket after 1 day, only 400 files come back in the list, but it can take over 1 minute to return (which is why I need to add --cli-read-timeout 0):
# List all the files in the bucket
aws --endpoint-url $HCP_HOST s3 ls s3://$HCP_BUCKET/ --cli-read-timeout 0 --summarize
# Output
Total Objects: 400
Total Size: 400
I can also list and see all of the old unwanted versions:
# List object versions and parse output with jq
aws --endpoint-url $HCP_HOST s3api list-object-versions --bucket $HCP_BUCKET --cli-read-timeout 0 | jq -c '.Versions[] | {"key": .Key, "version_id": .VersionId, "latest": .IsLatest}'
Output:
{"key":"file_001.txt","version_id":"107250810359745","latest":false}
{"key":"file_001.txt","version_id":"107250814851905","latest":false}
{"key":"file_001.txt","version_id":"107250827750849","latest":false}
{"key":"file_001.txt","version_id":"107250828383425","latest":false}
{"key":"file_001.txt","version_id":"107251210538305","latest":false}
{"key":"file_001.txt","version_id":"107251210707777","latest":false}
{"key":"file_001.txt","version_id":"107251210872641","latest":false}
{"key":"file_001.txt","version_id":"107251212449985","latest":false}
{"key":"file_001.txt","version_id":"107251212455681","latest":false}
{"key":"file_001.txt","version_id":"107251212464001","latest":false}
{"key":"file_001.txt","version_id":"107251212470209","latest":false}
{"key":"file_001.txt","version_id":"107251212644161","latest":false}
{"key":"file_001.txt","version_id":"107251212651329","latest":false}
{"key":"file_001.txt","version_id":"107251217133185","latest":false}
{"key":"file_001.txt","version_id":"107251217138817","latest":false}
{"key":"file_001.txt","version_id":"107251217145217","latest":false}
{"key":"file_001.txt","version_id":"107251217150913","latest":false}
{"key":"file_001.txt","version_id":"107251217156609","latest":false}
{"key":"file_001.txt","version_id":"107251217163649","latest":false}
{"key":"file_001.txt","version_id":"107251217331201","latest":false}
{"key":"file_001.txt","version_id":"107251217343617","latest":false}
{"key":"file_001.txt","version_id":"107251217413505","latest":false}
{"key":"file_001.txt","version_id":"107251217422913","latest":false}
{"key":"file_001.txt","version_id":"107251217428289","latest":false}
{"key":"file_001.txt","version_id":"107251217433537","latest":false}
{"key":"file_001.txt","version_id":"107251344110849","latest":true}
// ...
I thought I could just run a job that cleans up the old versions on a regular basis, but I've tried to delete the old versions and it fails with an error:
# Try deleting an old version for the file_001.txt key
aws --endpoint-url $HCP_HOST s3api delete-object --bucket $HCP_BUCKET --key "file_001.txt" --version-id 107250810359745
# Error
An error occurred (NotImplemented) when calling the DeleteObject operation:
Only the current version of an object can be deleted.
I've tested this using MinIO and AWS S3 and my use-case works perfectly fine on both of those platforms.
Is there anything I'm doing incorrectly, or is there a setting in HCP that I'm missing that could make it so I can overwrite objects on sync while retaining no previous versions? Alternatively, is there a way to manually delete the previous versions?

How to fix AWS S3 error: socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed

We have a long running Pentaho job in Rundeck that has recently started to experience failures.
The Rundeck step definition is:
echo "Using AMI $(aws ec2 describe-images --owners 660354881677 --filters 'Name=name,Values="ds-pentaho-v1.15.0"' --query 'sort_by(Images, &CreationDate)[-1].ImageId' --output text)"
/usr/local/bin/pentaho-cli run-ec2 --environment prd \
--ami-id $(aws ec2 describe-images --owners 660354881677 --filters 'Name=name,Values="ds-pentaho-v1.15.0"' --query 'sort_by(Images, &CreationDate)[-1].ImageId' --output text) \
--arg BOOL_ARG_1=true \
--instance-type "m5.2xlarge" \
--on-demand \
--max-fleet-retries "2" \
--emr-host "some_emr_host.net" \
--dir "/some/job/path/" \
--job "some_job_name"
Specifically, the job has gone from consistently completing to not completing and just hangs in a long-running status. Digging through the logs, we find this:
2022/12/02 06:21:31 - tfo: task_read_table_2 - linenr 13900000
2022/12/02 06:21:11 - tfo: task_read_table_1 - linenr 900000 <- Last Line output for table 1 here
2022/12/02 06:21:59 - tfo: task_read_table_2 - linenr 15150000
2022/12/02 06:22:00 - tfo: task_read_table_2 - linenr 15200000
2022/12/02 06:22:01 - tfo: task_read_table_2 - linenr 15250000
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket
connection to the server was not read from or written to within the
timeout period. Idle connections will be closed. (Service: Amazon S3;
Status Code: 400; Error Code: RequestTimeout;
2022/12/02 06:22:02 - tfo: task_read_table_2 - linenr 15300000
2022/12/02 06:22:03 - tfo: task_read_table_2 - linenr 15350000
2022/12/02 06:24:52 - in: task_read_table_2 - Finished reading query, closing connection.
2022/12/02 06:24:54 - tfo: task_read_table_3 - linenr 50000
...
2022/12/02 06:35:53 - tfo: task_read_table_3 - linenr 37500000
2022/12/02 06:35:54 Finished reading query, closing connection.
<The job hangs here!>
The AWS S3 exception above occurs anywhere between 10 minutes and 2 hours of the job running.
The job is on a parallel step that is pulling data from multiple PostgreSQL server tables and loading the data into S3 with a text output task. It seems like S3 is hanging up after the read on task_read_table_1 - the data is not written to S3. The job continues to pull records from the other source tables until completion, but it never writes those table outputs to S3 either. From there, the job just hangs. The site engineers are not sure what is going on here. I think this may be an issue with how either AWS or Rundeck are setup. Note: we use terraform to manage timeouts and those are currently set to 24 hours, which is well within the timeout period.
The number of records between successful and unsuccessful runs appears to be the same. There does not appear to be much recent reliable search results on the internet - most results are 5-10 years old, which does not seem relevant.
I do not think this is a problem with the Pentaho job itself because it has completed without fail in the past and the overall records counts of what is being pulled/loaded is stable.
Does anyone know what is potentially causing this issue or how it can be diagnosed?
Note: This is my first engagement working with AWS, Rundeck, Terraform, and Pentaho. I am more of an ETL developer than a site engineer. Any help is appreciated.

How to start certain number of nodes in Redis cluster

To create and start a cluster in Redis, I use create-cluster.sh file inside
/redis-3.04/utils/create-cluster
With the use of this I can create as many nodes I want by changing the:
Settings
PORT=30000
TIMEOUT=2000
NODES=10
REPLICAS=1.
I wonder if I can create for example 10 nodes (5 masters with 5 slaves) in the beginning but start only 4 masters and 4 slaves (meet and join).
Thanks in advance.
Yes. You can add more nodes if load increase on your existing cluster .
Basic Steps are :
Start new redis instances - Let's say you want to add 2 more master and there slaves (Total 4 redis instances)
Then using redis-trib utility do following :
redis-trib.rb add-node <new master node:port> <any existing master>
e.g. ./redis-trib.rb add-node 192.168.1.16:7000 192.168.1.15:7000
After this new node will be assigned an id . Note that id and run following command to add slave to node that we added in prev step
/redis-trib.rb add-node --slave --master-id <master-node-id> <new-node> <master-node>
./redis-trib.rb add-node --slave --master-id 6f9db976c3792e06f9cd252aec7cf262037bea4a 192.168.1.17:7000 192.168.1.16:7000
where 6f9db976c3792e06f9cd252aec7cf262037bea4a is id of 192.168.1.16:7000.
Using similar steps you can add 1 more master-slave pair .
Since these node do not contains any slots to serve, you have move some of the slots from existing masters to new masters .( Re-Sharding)
To that you can run following command/Resharding steps :
6.1 ./redis-trib.rb reshard <any-master-ip>:<master-port>
6.2 It will ask : How many slots do you want to move (from 1 to 16384)? => Enter number of slots you want to move
6.3 Then it will ask : What is the receiving node ID?
6.4 Enter node id to which slots need to be moved . (new masters)
6.5 It will prompt :
Please enter all the source node IDs.
Type 'all' to use all the nodes as source nodes for the hash slots.
Type 'done' once you entered all the source nodes IDs.
Source node #1: (enter source node id or all)
6.6 Then it will prompt info saying Moving slot n to node node-id like
Moving slot 10960 from 37d10f18f349a6a5682c791bff90e0188ae35e49
Moving slot 10961 from 37d10f18f349a6a5682c791bff90e0188ae35e49
Moving slot 10962 from 37d10f18f349a6a5682c791bff90e0188ae35e49
6.7 It will ask : Do you want to proceed with the proposed reshard plan (yes/no)? Type Yes and enter and you are done .
Note : If data is large it might take some time to reshard.
Few Commands :
To know all nodes in cluster and cluster nodes with node ids:
redis-cli -h node-ip -p node-port cluster nodes
e.g. redis-cli -h 127.0.0.1 -p 7000 cluster nodes
To know all slots in cluster :
redis-cli -h 127.0.0.1 -p 7000 cluster slots
Ref : https://redis.io/commands/cluster-nodes
Hope this will help .

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.