Unable to ingest data from flume to hdfs hadoop for logs - apache

I am using following configuration for pushing data to hdfs from log file.
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity=5000
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt
agent.sources.tail-source.channels = memory-channel
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.batchSize=10
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/flume/data/log.txt
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
I got no error message, but still i m not able to find out the output in hdfs.
on interrupting I can see sink interruption exception & some data of that log file.
I am running following command:
flume-ng agent --conf /etc/flume-ng/conf/ --conf-file /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent;

I had a similar issue. In my case now it's working. Below is the conf file:
#Exec Source
execAgent.sources=e
execAgent.channels=memchannel
execAgent.sinks=HDFS
#channels
execAgent.channels.memchannel.type=file
execAgent.channels.memchannel.capacity = 20000
execAgent.channels.memchannel.transactionCapacity = 1000
#Define Source
execAgent.sources.e.type=org.apache.flume.source.ExecSource
execAgent.sources.e.channels=memchannel
execAgent.sources.e.shell=/bin/bash -c
execAgent.sources.e.fileHeader=false
execAgent.sources.e.fileSuffix=.txt
execAgent.sources.e.command=cat /home/sample.txt
#Define Sink
execAgent.sinks.HDFS.type=hdfs
execAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/
execAgent.sinks.HDFS.hdfs.fileType=DataStream
execAgent.sinks.HDFS.hdfs.writeFormat=Text
execAgent.sinks.HDFS.hdfs.batchSize=1000
execAgent.sinks.HDFS.hdfs.rollSize=268435
execAgent.sinks.HDFS.hdfs.rollInterval=0
#Bind Source Sink Channel
execAgent.sources.e.channels=memchannel
execAgent.sinks.HDFS.channel=memchannel

I suggest using the prefix configuration when placing files in HDFS:
agent.sinks.hdfs-sink.hdfs.filePrefix = log.out

#bhavesh - Are you sure, the log file(agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt) keeps appending data ? Since you have used a Tail command with -F, only changed data(within the file) will be dumped into HDFS

Related

Flume not writing correctly in amazon s3 (weird characters)

My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"û­þ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??
Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream

Create data source for Weblogic 12.2.1.3 in offline mode

I want to create Weblogic data source using WLST in offline mode and I'm getting error when I want to create Properties:
create('my_Prop','Properties')
Here is the entire script:
readDomain('C:\\weblogic12213\\user_projects\\domains\\myDomain')
cd('/')
create('myJDBC', 'JDBCSystemResource')
cd('/JDBCSystemResource/myJDBC')
set('Target','myApp')
cd('/JDBCSystemResource/myJDBC/JdbcResource/myJDBC')
cmo.setName('myJDBC')
create('myJDBC','JDBCDataSourceParams')
cd('JDBCDataSourceParams/myJDBC')
set('JNDIName', java.lang.String('jdbc.myJDBC'))
set('GlobalTransactionsProtocol', java.lang.String('OnePhaseCommit'))
cd('/JDBCSystemResource/myJDBC/JdbcResource/myJDBC')
create('myJDBC','JDBCDriverParams')
cd('JDBCDriverParams/myJDBC')
set('DriverName','weblogic.jdbc.sqlserver.SQLServerDriver')
set('URL','jdbc:weblogic:sqlserver://localhost:1433;allowPortWithNamedInstance=true')
set('PasswordEncrypted', 'myPassword')
set('UseXADataSourceInterface', 'false')
create('my_Prop','Properties')
cd('Properties/myJDBC')
create('user','Property')
cd('Property/user')
set('Value', 'myUser')
cd('/JDBCSystemResource/myJDBC/JdbcResource/myJDBC')
create('myJDBC','JDBCConnectionPoolParams')
cd('JDBCConnectionPoolParams/myJDBC')
set('TestTableName','SQL SELECT 1')
updateDomain()
closeDomain()
exit()
This error appears:
com.oracle.cie.domain.script.jython.WLSTException: Could not create generic operation:Properties
#com.oracle.cie.domain.operation.OperationBuilder.createConfigOperation(OperationBuilder.java:342)
at com.oracle.cie.domain.script.jython.CommandExceptionHandler.handleException(CommandExceptionHandler.java:69)
at com.oracle.cie.domain.script.jython.WLScriptContext.handleException(WLScriptContext.java:2983)
Does anybody have any idea please?
I suppose you already found the solution. This works for me without errors.
# cd into the already created driver params
cd('/JDBCSystemResource/myJDBC/JdbcResource/myJDBC/JDBCDriverParams/NO_NAME_0')
create('properties','Properties')
cd('Properties/NO_NAME_0')
create('property','Property')
cd('Property/property')
set("Key", "key")
set("Value", "value")
"""
This script configures a JDBC data source as a System Module and deploys it
to the server
"""
url='t3://' + sys.argv[1] + ':' + sys.argv[2]
username = sys.argv[3]
password = sys.argv[4]
connect(username,password,url)
edit()
# Change these names as necessary
dsname="myJDBCDataSource"
server=sys.argv[5]
cd("Servers/"+server)
target=cmo
cd("../..")
startEdit()
# start creation
print 'Creating JDBCSystemResource with name '+dsname
jdbcSR = create(dsname,"JDBCSystemResource")
theJDBCResource = jdbcSR.getJDBCResource()
theJDBCResource.setName("myJDBCDataSource")
connectionPoolParams = theJDBCResource.getJDBCConnectionPoolParams()
connectionPoolParams.setConnectionReserveTimeoutSeconds(25)
connectionPoolParams.setMaxCapacity(100)
connectionPoolParams.setTestTableName("SYSTABLES")
dsParams = theJDBCResource.getJDBCDataSourceParams()
dsParams.addJNDIName("ds.myJDBCDataSource")
driverParams = theJDBCResource.getJDBCDriverParams()
driverParams.setUrl("jdbc:derby://localhost:1527/examples;create=true")
driverParams.setDriverName("org.apache.derby.jdbc.ClientXADataSource")
# driverParams.setUrl("jdbc:oracle:thin:#my-oracle-server:my-oracle-server-port:my-oracle-sid")
# driverParams.setDriverName("oracle.jdbc.driver.OracleDriver")
driverParams.setPassword("examples")
# driverParams.setLoginDelaySeconds(60)
driverProperties = driverParams.getProperties()
proper = driverProperties.createProperty("user")
#proper.setName("user")
proper.setValue("examples")
proper1 = driverProperties.createProperty("DatabaseName")
#proper1.setName("DatabaseName")
proper1.setValue("examples")
jdbcSR.addTarget(target)
save()
activate(block="true")
print 'Done configuring the data source'`enter code here`

How to read specific numbers out of a file

I want to write a Lua script which will save and load my vars back into my program. I searched a bit in the Internet for code examples and now i have this:
--SetUp vars
accept = 1
strenght = 5
hp = 2
--create file
local f = assert(io.open("quicksave", "w"))
f:write(accept, "\n")
f:write(strenght, "\n")
f:write(hp, "\n")
f:close()
--Set vars to 0(simulate restart of program)
accept = 0
strenght = 0
hp = 0
print("accept: "..accept.." Strenght: "..strenght.." HP: "..hp)
--load in the saved vars
local f = assert(io.open("quicksave", "r"))
accept = f:read("*line")
strenght = f:read("*line")
hp = f:read("*line")
f:close()
print("accept: "..accept.." Strenght: "..strenght.." HP: "..hp)
This works fine for me, but how can I read only specific values from the file? For example: what should I do if i want to read out only the second line of the file (the var for strength)?
You can simply read and discard the first line:
--load in the second saved var
local f = assert(io.open("quicksave", "r"))
f:read("*line")
strenght = f:read("*line")
Nevertheless, I suggest you save your data as a Lua script that can be loaded with dofile. Something like:
return {
accept = 1,
strenght = 5,
hp = 2
}
Then you can load it into a local variable and read the fields you need:
local state = dofile("state.lua")
strenght = state.strenght

How to set log filename in flume

I am using Apache flume for log collection. This is my config file
httpagent.sources = http-source
httpagent.sinks = local-file-sink
httpagent.channels = ch3
#Define source properties
httpagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
httpagent.sources.http-source.channels = ch3
httpagent.sources.http-source.port = 8082
# Local File Sink
httpagent.sinks.local-file-sink.type = file_roll
httpagent.sinks.local-file-sink.channel = ch3
httpagent.sinks.local-file-sink.sink.directory = /home/avinash/log_dir
httpagent.sinks.local-file-sink.sink.rollInterval = 21600
# Channels
httpagent.channels.ch3.type = memory
httpagent.channels.ch3.capacity = 1000
My application is working fine.My problem is that in the log_dir the files are using some random number (I guess its timestamp) timestamp as by default.
How to give a proper filename suffix for logfiles ?
Having a look on the documentation it seems there is no parameter for configuring the name of the files that are going to be created. I've gone to the sources looking for some hidden parameter, but there is no one :)
Going into the details of the implementation, it seems the name of the file is managed by the PathManager class:
private PathManager pathController;
...
#Override
public Status process() throws EventDeliveryException {
...
if (outputStream == null) {
File currentFile = pathController.getCurrentFile();
logger.debug("Opening output stream for file {}", currentFile);
try {
outputStream = new BufferedOutputStream(new FileOutputStream(currentFile));
...
}
Which, as you already noticed, is based on the current timestamp (showing the constructor and the next file getter):
public PathManager() {
seriesTimestamp = System.currentTimeMillis();
fileIndex = new AtomicInteger();
}
public File nextFile() {
currentFile = new File(baseDirectory, seriesTimestamp + "-" + fileIndex.incrementAndGet());
return currentFile;
}
So, I think the only possibility you have is to extend the File Roll sink and override the process() method in order to use a custom path controller.
For sources you have execute commands to tail and pre-pend or append details, based on shell scripting. Below is a sample:
# Describe/configure the source for tailing file
httpagent.sources.source.type = exec
httpagent.sources.source.shell = /bin/bash -c
httpagent.sources.source.command = tail -F /path/logs/*_details.log
httpagent.sources.source.restart = true
httpagent.sources.source.restartThrottle = 1000
httpagent.sources.source.logStdErr = true

Can apache flume hdfs sink accept dynamic path to write?

I am new to apache flume.
I am trying to see how I can get a json (as http source), parse it and store it to a dynamic path on hdfs according to the content.
For example:
if the json is:
[{
"field1" : "value1",
"field2" : "value2"
}]
then the hdfs path will be:
/some-default-root-path/value1/value2/some-value-name-file
Is there such configuration of flume that enables me to do that?
Here is my current configuration (accepts a json via http, and stores it in a path according to timestamp):
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Thanks!
The solution was in the flume documentation for the hdfs sink:
Here is the revised configuration:
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%{field1}
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
and the curl:
curl -X POST -d '[{ "headers" : { "timestamp" : "434324343", "host" :"random_host.example.com", "field1" : "val1" }, "body" : "random_body" }]' localhost:9000