Can apache flume hdfs sink accept dynamic path to write? - apache

I am new to apache flume.
I am trying to see how I can get a json (as http source), parse it and store it to a dynamic path on hdfs according to the content.
For example:
if the json is:
[{
"field1" : "value1",
"field2" : "value2"
}]
then the hdfs path will be:
/some-default-root-path/value1/value2/some-value-name-file
Is there such configuration of flume that enables me to do that?
Here is my current configuration (accepts a json via http, and stores it in a path according to timestamp):
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Thanks!

The solution was in the flume documentation for the hdfs sink:
Here is the revised configuration:
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%{field1}
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
and the curl:
curl -X POST -d '[{ "headers" : { "timestamp" : "434324343", "host" :"random_host.example.com", "field1" : "val1" }, "body" : "random_body" }]' localhost:9000

Related

Flume not writing correctly in amazon s3 (weird characters)

My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"û­þ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??
Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream

Confluent Generalized S3 Source connector (silent failure)

Running into strange issue with Generalized S3 source connector running on confluent platform. Not able to pin point where exactly the error is or what the root cause is :
The only error I see in the ssh console is this (related to logs) :
[2023-02-11 11:12:45,464] INFO [Worker clientId=connect-1, groupId=connect-cluster-1] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1709)
log4j:ERROR A "io.confluent.log4j.redactor.RedactorAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [PluginClassLoader{pluginLocation=file:/usr/share/java/source-2.5.1/}] whereas object of type
log4j:ERROR "io.confluent.log4j.redactor.RedactorAppender" was loaded by [jdk.internal.loader.ClassLoaders$AppClassLoader#251a69d7].
log4j:ERROR Could not instantiate appender named "redactor".
[2023-02-11 11:13:35,741] INFO Injecting Confluent license properties into connector '<unspecified>' (org.apache.kafka.connect.runtime.WorkerConfigDecorator:412)
[2023-02-11 11:13:44,001] INFO Injecting Confluent license properties into connector 'S3GenConnectorConnector_7' (org.apache.kafka.connect.runtime.WorkerConfigDecorator:412)
[2023-02-11 11:13:44,006] INFO S3SourceConnectorConfig values:
aws.access.key.id = <<ACCESS KEY HERE>>
aws.secret.access.key = [hidden]
behavior.on.error = fail
bucket.listing.max.objects.threshold = -1
confluent.license = [hidden]
confluent.topic = _confluent-command
confluent.topic.bootstrap.servers = [172.27.157.66:9092]
confluent.topic.replication.factor = 3
directory.delim = /
file.discovery.starting.timestamp = 0
filename.regex = (.+)\+(\d+)\+.+$
folders = []
format.bytearray.extension = .bin
format.bytearray.separator =
format.class = class io.confluent.connect.s3.format.string.StringFormat
format.json.schema.enable = false
mode = RESTORE_BACKUP
parse.error.topic.prefix = error
partition.field.name = []
partitioner.class = class io.confluent.connect.storage.partitioner.DefaultPartitioner
path.format =
record.batch.max.size = 200
s3.bucket.name = mytestbucketamtk
s3.credentials.provider.class = class com.amazonaws.auth.DefaultAWSCredentialsProviderChain
s3.http.send.expect.continue = true
s3.part.retries = 3
s3.path.style.access = true
s3.poll.interval.ms = 60000
s3.proxy.password = null
s3.proxy.url =
s3.proxy.username = null
s3.region = us-east-1
s3.retry.backoff.ms = 200
s3.sse.customer.key = null
s3.ssea.name =
s3.wan.mode = false
schema.cache.size = 50
store.url = null
task.batch.size = 10
topic.regex.list = [first_topic:.*]
topics.dir = topics
(io.confluent.connect.s3.source.S3SourceConnectorConfig:376)
[2023-02-11 11:13:44,029] INFO Using configured AWS access key credentials instead of configured credentials provider class. (io.confluent.connect.s3.source.S3Storage:500)
Connector Config file below :
{
"name": "S3GenConnectorConnector_7",
"config": {
"connector.class": "io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"mode": "RESTORE_BACKUP",
"format.class": "io.confluent.connect.s3.format.string.StringFormat",
"s3.bucket.name": "mytestbucketamtk",
"s3.region": "us-east-1",
"aws.access.key.id": <<ACCESS KEY HERE>>,
"aws.secret.access.key": <<SECRET KEY>>,
"topic.regex.list":"first_topic:.*"
}
}
The tasks are not getting created. And also no other errors in the connect console. The connect cluster is running on confluent platform. Any pointers in right direction would be appreciated. Did I miss any required configuration ?

Apache Flume with 2 different interceptors on same source

I am trying to add 2 different interceptors on the same source and send the intercepted data to 2 different channels.
But, I was not able to configure the same. Couldn't find any documentation about the same. Also, I am having some issues with the channel selectors. Not sure how to select a channel with the different interceptors.
Here is my code so far:
a1.sources = syslog_udp
a1.channels = chan1 chan2
a1.sinks = sink1 sink2 //both are different kafka sinks
a1.sources.syslog_udp.type = syslogudp
a1.sources.syslog_udp.port = 514
a1.sources.syslog_udp.host = 0.0.0.0
a1.sources.syslog_udp.keepFields = true
a1.sources.syslog_udp.interceptors = i1 i2
a1.sources.syslog_udp.interceptors.i1.type = regex_filter
a1.sources.syslog_udp.interceptors.i1.regex = '<regex_string1>'
a1.sources.syslog_udp.interceptors.i1.excludeEvents = false
a1.sources.syslog_udp.interceptors.i2.type = regex_filter
a1.sources.syslog_udp.interceptors.i2.regex = '<regex_string1>'|'<regex_string2>'
a1.sources.syslog_udp.interceptors.i2.excludeEvents = false
a1.sources.syslog_udp.selector.type = multiplexing
a1.sources.syslog_udp.channels = chan1 chan2
a1.channels.chan1.type = memory
a1.channels.chan1.capacity = 200
a1.channels.chan2.type = memory
a1.channels.chan2.capacity = 200
Seems like there is no straight-forward setup for this.
A work-around for this kind of layout is to have a single/wider channel interceptor in one agent, pipe the output to an avro-sink and setup a new agent for the avro-source and set-up the new channel interceptor on that.

telegraf disk-input does not write to output in phusion/baseimage

Just used telegraf and influxdb with some other plugins.
But the output of [[inputs.disk]] is not sent to the influx-database, although the telegraf-cli prints the series:
root#99a3dda91f0e:/# telegraf --config /etc/telegraf/telegraf.conf --test
* Plugin: inputs.disk, Collection 1
> disk,path=/,device=none,fstype=aufs,host=99a3dda91f0e,dockerhost=0zizhqemxr3fmhr949qqg94ly free=92858503168i,used=5304786944i,used_percent=5.404043546164225,inodes_total=6422528i,inodes_free=6192593i,inodes_used=229935i,total=103441399808i 1504273867000000000
> disk,path=/usr/share/zoneinfo/Etc/UTC,device=sda1,fstype=ext4,host=99a3dda91f0e,dockerhost=0zizhqemxr3fmhr949qqg94ly used=5304786944i,used_percent=5.404043546164225,inodes_total=6422528i,inodes_free=6192593i,inodes_used=229935i,total=103441399808i,free=92858503168i 1504273867000000000
> disk,path=/etc/resolv.conf,device=sda1,fstype=ext4,host=99a3dda91f0e,dockerhost=0zizhqemxr3fmhr949qqg94ly inodes_free=253014i,inodes_used=729i,total=207867904i,free=191041536i,used=16826368i,used_percent=8.094740783069618,inodes_total=253743i 1504273867000000000
> disk,path=/etc/hostname,device=sda1,fstype=ext4,host=99a3dda91f0e,dockerhost=0zizhqemxr3fmhr949qqg94ly total=103441399808i,free=92858503168i,used=5304786944i,used_percent=5.404043546164225,inodes_total=6422528i,inodes_free=6192593i,inodes_used=229935i 1504273867000000000
> disk,dockerhost=0zizhqemxr3fmhr949qqg94ly,path=/etc/hosts,device=sda1,fstype=ext4,host=99a3dda91f0e total=103441399808i,free=92858503168i,used=5304786944i,used_percent=5.404043546164225,inodes_total=6422528i,inodes_free=6192593i,inodes_used=229935i 1504273867000000000
* Plugin: inputs.kernel, Collection 1
> kernel,host=99a3dda91f0e,dockerhost=0zizhqemxr3fmhr949qqg94ly interrupts=38110293i,context_switches=66702050i,boot_time=1504190750i,processes_forked=227872i 1504273867000000000
Within influx:
> use monitoring
Using database monitoring
> show measurements
name: measurements
name
----
kernel
>
the telegraf config:
[global_tags]
host = "$HOSTNAME"
dockerhost = "$DOCKERHOSTNAME"
# Configuration for telegraf agent
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = true
logfile = ""
hostname = ""
omit_hostname = false
[[outputs.influxdb]]
urls = ["http://influxdb:8086"] # required
database = "$INFLUX_DATABASE"
retention_policy = ""
write_consistency = "any"
timeout = "5s"
[[inputs.disk]]
## Setting mountpoints will restrict the stats to the specified mountpoints.
# mount_points = ["/"]
## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
## present on /run, /var/run, /dev/shm or /dev).
ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.kernel]]
Telegraf v1.3.5 (git: release-1.3 7192e68b2423997177692834f53cdf171aee1a88)
InfluxDB v1.3.2 (git: 1.3 742b9cb3d74ff1be4aff45d69ee7c9ba66c02565)
//edited: of course:
echo $INFLUX_DATABASE
monitoring
If I add other inputs again, like [[inputs.diskio]], they appear in the database immediately.
seems like there is an issue while getting the when running telegraf as a runsv-script in phusion/baseimage

Unable to ingest data from flume to hdfs hadoop for logs

I am using following configuration for pushing data to hdfs from log file.
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity=5000
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt
agent.sources.tail-source.channels = memory-channel
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.batchSize=10
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/flume/data/log.txt
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
I got no error message, but still i m not able to find out the output in hdfs.
on interrupting I can see sink interruption exception & some data of that log file.
I am running following command:
flume-ng agent --conf /etc/flume-ng/conf/ --conf-file /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent;
I had a similar issue. In my case now it's working. Below is the conf file:
#Exec Source
execAgent.sources=e
execAgent.channels=memchannel
execAgent.sinks=HDFS
#channels
execAgent.channels.memchannel.type=file
execAgent.channels.memchannel.capacity = 20000
execAgent.channels.memchannel.transactionCapacity = 1000
#Define Source
execAgent.sources.e.type=org.apache.flume.source.ExecSource
execAgent.sources.e.channels=memchannel
execAgent.sources.e.shell=/bin/bash -c
execAgent.sources.e.fileHeader=false
execAgent.sources.e.fileSuffix=.txt
execAgent.sources.e.command=cat /home/sample.txt
#Define Sink
execAgent.sinks.HDFS.type=hdfs
execAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/
execAgent.sinks.HDFS.hdfs.fileType=DataStream
execAgent.sinks.HDFS.hdfs.writeFormat=Text
execAgent.sinks.HDFS.hdfs.batchSize=1000
execAgent.sinks.HDFS.hdfs.rollSize=268435
execAgent.sinks.HDFS.hdfs.rollInterval=0
#Bind Source Sink Channel
execAgent.sources.e.channels=memchannel
execAgent.sinks.HDFS.channel=memchannel
I suggest using the prefix configuration when placing files in HDFS:
agent.sinks.hdfs-sink.hdfs.filePrefix = log.out
#bhavesh - Are you sure, the log file(agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt) keeps appending data ? Since you have used a Tail command with -F, only changed data(within the file) will be dumped into HDFS