Single file output from multiple source files with fluent-bit - amazon-s3

We are using fluent-bit to capture multiple logs within a directory, do some basic parsing and filtering, and sending output to s3. Each source file seems to correspond to a separate output file in the bucket rather than a combined output.
Is it possible to send multiple input files to a single output file in fluent-bit, or is this simply how the buffer flush behavior works?
Here is our config for reference:
[SERVICE]
Daemon Off
Flush 1
Log_Level warn
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
Health_Check Off
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /tmp/fluentbit/
storage.max_chunks_up 128
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
storage.type filesystem
Mem_Buf_Limit 10MB
Buffer_Chunk_Size 2M
Buffer_Max_size 256M
Skip_Long_Lines On
Skip_Empty_Lines On
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
Merge_Log_Key msg-json
K8S-Logging.Parser On
K8S-Logging.Exclude On
Cache_Use_Docker_Id On
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes_
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes_labels
Add_prefix kubernetes_labels_
[FILTER]
Name aws
Match *
imds_version v1
az true
ec2_instance_id true
ec2_instance_type true
private_ip true
account_id true
hostname true
vpc_id true
[OUTPUT]
Name s3
Match *
bucket <bucket name redacted>
region us-east-1
total_file_size 100M
upload_timeout 60s
use_put_object true
compression gzip
store_dir_limit_size 500m
s3_key_format /fluentbit/team/%Y.%m.%d.%H_%M_%S.$UUID.gz
static_file_path On

It is possible to send multiple input files to single output file.
The issue here might be with your use of the s3_key_format.
Your current file name format is '/fluentbit/team/%Y.%m.%d.%H_%M_%S.$UUID.gz' and this has a UUID which causes each input file being written to a separate output file in S3.
To combine and send to single output file, just modify it to '/fluentbit/team/%Y.%m.%d.gz'

Related

Removing liveness logs from FluentBit in EKS

Trying to stop shipping liveness logs to AWS Cloudwatch to reduce charges on excessive logging. The grep filter doesn't seem to have any impact. What am I missing?
[SERVICE]
Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Parser docker
Docker_Mode On
Docker_Mode_Flush On
Docker_Mode_Parser cwagent_firstline
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Merge_Log_Key data
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels Off
Annotations Off
[FILTER]
Name grep
Match *
Exclude log /*liveness*/
[OUTPUT]
Name cloudwatch
Match *
region us-east-2
log_group_name application
log_stream_prefix fluentbit-
log_retention_days 14
auto_create_group true
Tried making changes to the configuration but the grep filter has no effect on the logs moved to AWS.
{"log":"{"level":"info","Remote Address":"::ffff:10.32.11.173 - -","Date":"2022-11-08T21:38:50.246Z","Method":"GET - /liveness - -","Status":"200","Response":"68 - 2.451 ms","Referrer":"- - kube-probe/1.21+"}\n","stream":"stdout","time":"2022-11-08T21:38:50.246629696Z"}

Snowflake - Azure File upload - How can i partition the file if size is more than 40MB

I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000
We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog

Jmeter non GUI mode csv report not showing latency

I am trying to get the jmeter html report for file transfer in SFTP protocol.
I am using SSH SFTP Protocol plugin and added Simple Data Writer to that thread group.
I have created my own sftp server using Apache MINA. Jmeter script will hit the server which i created and uploads the file.
Script Parameters:
Thread Group - 250
Ramp up period - 50
Loop Count - 1
After running the script in non GUI mode as nohup sh jmeter.sh -n -t Singlepart_MultipleThread_RampUp.jmx -l Singlepart_MultipleThread_RampUp.jtl. I do get a csv generated which i convert into html report command jmeter -g <csv> -o <destination_folder>.
The html report created shows Latency vs Time and Latency vs Request as zero and even the csv report shows latency column as zero.
Below is my user.properties file
user.properties
# Latencies Over Time graph definition
jmeter.reportgenerator.graph.latenciesOverTime.classname=org.apache.jmeter.report.processor.graph.impl.LatencyOverTimeGraphConsumer
jmeter.reportgenerator.graph.latenciesOverTime.title=Latencies Over Time
jmeter.reportgenerator.graph.latenciesOverTime.property.set_granularity=${jmeter.reportgenerator.overall_granularity}
# Latencies Vs Request graph definition
jmeter.reportgenerator.graph.latencyVsRequest.classname=org.apache.jmeter.report.processor.graph.impl.LatencyVSRequestGraphConsumer
jmeter.reportgenerator.graph.latencyVsRequest.title=Latencies Vs Request
jmeter.reportgenerator.graph.latencyVsRequest.exclude_controllers=true
jmeter.reportgenerator.graph.latencyVsRequest.property.set_granularity=${jmeter.reportgenerator.overall_granularity}
jmeter.properties
#---------------------------------------------------------------------------
# Results file configuration
#---------------------------------------------------------------------------
# This section helps determine how result data will be saved.
# The commented out values are the defaults.
# legitimate values: xml, csv, db. Only xml and csv are currently supported.
jmeter.save.saveservice.output_format=csv
# The below properties are true when field should be saved; false otherwise
#
# assertion_results_failure_message only affects CSV output
jmeter.save.saveservice.assertion_results_failure_message=true
#
# legitimate values: none, first, all
jmeter.save.saveservice.assertion_results=all
#
jmeter.save.saveservice.data_type=true
jmeter.save.saveservice.label=true
jmeter.save.saveservice.response_code=true
# response_data is not currently supported for CSV output
jmeter.save.saveservice.response_data=true
# Save ResponseData for failed samples
jmeter.save.saveservice.response_data.on_error=false
jmeter.save.saveservice.response_message=true
jmeter.save.saveservice.successful=true
jmeter.save.saveservice.thread_name=true
jmeter.save.saveservice.time=true
jmeter.save.saveservice.subresults=true
jmeter.save.saveservice.assertions=true
jmeter.save.saveservice.latency=true
# Only available with HttpClient4
#jmeter.save.saveservice.connect_time=true
jmeter.save.saveservice.samplerData=true
#jmeter.save.saveservice.responseHeaders=false
#jmeter.save.saveservice.requestHeaders=false
#jmeter.save.saveservice.encoding=false
jmeter.save.saveservice.bytes=true
# Only available with HttpClient4
jmeter.save.saveservice.sent_bytes=true
jmeter.save.saveservice.url=true
jmeter.save.saveservice.filename=false
jmeter.save.saveservice.hostname=false
jmeter.save.saveservice.thread_counts=true
jmeter.save.saveservice.sample_count=false
jmeter.save.saveservice.idle_time=true
# Timestamp format - this only affects CSV output files
# legitimate values: none, ms, or a format suitable for SimpleDateFormat
#jmeter.save.saveservice.timestamp_format=ms
#jmeter.save.saveservice.timestamp_format=yyyy/MM/dd HH:mm:ss.SSS
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
# Default:
#jmeter.save.saveservice.default_delimiter=,
# For TAB, one can use:
#jmeter.save.saveservice.default_delimiter=\t
# Only applies to CSV format files:
# Print field names as first line in CSV
#jmeter.save.saveservice.print_field_names=true
# Optional list of JMeter variable names whose values are to be saved in the result data files.
# Use commas to separate the names. For example:
#sample_variables=SESSION_ID,REFERENCE
# N.B. The current implementation saves the values in XML as attributes,
# so the names must be valid XML names.
# By default JMeter sends the variable to all servers
# to ensure that the correct data is available at the client.
# Optional xml processing instruction for line 2 of the file:
# Example:
#jmeter.save.saveservice.xml_pi=<?xml-stylesheet type="text/xsl" href="../extras/jmeter-results-detail-report.xsl"?>
# Default value:
#jmeter.save.saveservice.xml_pi=
# Prefix used to identify filenames that are relative to the current base
#jmeter.save.saveservice.base_prefix=~/
# AutoFlush on each line written in XML or CSV output
# Setting this to true will result in less test results data loss in case of Crash
# but with impact on performances, particularly for intensive tests (low or no pauses)
# Since JMeter 2.10, this is false by default
#jmeter.save.saveservice.autoflush=false
So basically facing issue at two places:
How to get the latency value?
When i provide Ramp up value as 1, the script with Thread Group =50 takes around 16 seconds to complete the upload, whereas if i give Ramp up something other than 1 such as 10 then the script ends after 10 secs exact, irrespective of file is getting uploaded or not and providing vague results in html report as well.
Any idea how to solve this. Or need to do anything else in script.
You cannot as the plugin you're using doesn't call SampleResult.setLatency() function anywhere
theoretically it should be possible to request the functionality from the plugin developers
Setting 10 seconds ramp-up period for 50 virtual users means that JMeter starts with 1 virtual user and gradually increases the load to 50 within 10 seconds duration. Make sure to have enough loops defined in the Thread Group as you may run into the situation when 1st user has already finished uploading the file and was terminated and 2nd hasn't need started so you have maximum 1 user concurrency (it can be checked using Active Threads Over Time listener). See JMeter Test Results: Why the Actual Users Number is Lower than Expected for more detailed explanation if needed.

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Use flume to stream data to S3

I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)
I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.
a1.sources=src1
a1.sinks=sink1
a1.channels=ch1
#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log
#sink configuration
a1.sinks.sink1.type=logger
#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100
#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1
S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>#<bucket.name>/prefix/
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864 #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 10000
agent.sinks.s3hdfs.hdfs.rollInterval = 0
This makes sense, but can rollSize of this value be accompanied by
agent_messaging.sinks.AWSS3.hdfs.round = true
agent_messaging.sinks.AWSS3.hdfs.roundValue = 30
agent_messaging.sinks.AWSS3.hdfs.roundUnit = minute