I am working on HDP (Hortonworks) and trying to collect Tweets through flume and to load stored data from Hive.
The problem is select * from tweetsavro limit 1; works but select * from tweetsavro limit 2; does not work because
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
What I did is written in this answer. Namely
twitter.conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.serializer = Text
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
twitter.avsc is created by the following command.
java -jar avro-tools-1.7.7.jar getschema FlumeData.1503479843633 > twitter.avsc
I created a table by
CREATE TABLE tweetsavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs://sandbox.hortonworks.com:8020/user/flume/twitter.avsc') ;
LOAD DATA INPATH 'hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/FlumeData.*' OVERWRITE INTO TABLE tweetsavro;
Remarks:
I tried an external Table (instead of a managed one). But the situation did not change.
Because I use Hortonworks, I do not use Cloudera's TwitterSource.
add this to your configuration file
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000
Related
My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"ûþ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??
Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream
I have to read a file using this file format:
CREATE OR REPLACE FILE FORMAT FF_CSV
TYPE = CSV
COMPRESSION = GZIP
RECORD_DELIMITER = '\n'
FIELD_DELIMITER = 'µµµ'
FILE_EXTENSION = 'csv'
SKIP_HEADER = 0
SKIP_BLANK_LINES = TRUE
DATE_FORMAT = AUTO
TIME_FORMAT = AUTO
TIMESTAMP_FORMAT = AUTO
BINARY_FORMAT = UTF8
ESCAPE = NONE --may need to set to '<character>'
ESCAPE_UNENCLOSED_FIELD = NONE --may need to set to '<character>'
TRIM_SPACE = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"' --may need to set to '<character>'
NULL_IF = '' --( '<string>' [ , '<string>' ... ] )
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
REPLACE_INVALID_CHARACTERS = TRUE
VALIDATE_UTF8 = TRUE
EMPTY_FIELD_AS_NULL = TRUE
SKIP_BYTE_ORDER_MARK = FALSE
ENCODING = UTF8
;
For now I just want to test whether this definition would work correctly with my file or not. However, I am uncertain about how I can test for this. This is how I upload the file to my Snowflake stage:
put file://Users/myname/Desktop/leg.csv #:~
Now, how can I use FILE FORMAT FF_CSVin a select statement such that I can read my uploaded file using that format?
You would use a copy into statement to pull the data from the stage into a table.
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I would also recommend checking out some of these pages around Loading data into Snowflake.
https://docs.snowflake.com/en/user-guide-data-load.html
As part of the Snowflake WebUI Essentials course, I'm trying to load data from 'WEIGHT.TXT' on AWS S3 bucket into a Snowflake DB table.
select * from weight_ingest
> Result: 0 rows
list #S3TESTBKT/W
> Result:1
> s3://my-s3-tstbkt/WEIGHT.txt 509814 6e66e0c954a0dfe2c5d9638004a98912 Tue, 17 Dec 2019 14:52:52 GMT
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/W
FILES = 'WEIGHT.TXT'
FILE_FORMAT = (FORMAT_NAME=USDA_FILE_FORMAT)
> Result: Copy executed with 0 files processed.
Can someone please help me resolve this? Thanks in advance.
Further Information:
S3 Object URL: https://my-s3-tstbkt.s3.amazonaws.com/WEIGHT.txt (I'm able to open the file contents in a browser)
Path to file: s3://my-s3-tstbkt/WEIGHT.txt
File Format Definition:
ALTER FILE FORMAT "USDA_NUTRIENT_STDREF"."PUBLIC".USDA_FILE_FORMAT
SET COMPRESSION = 'AUTO'
FIELD_DELIMITER = '^'
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
Stage Definition:
ALTER STAGE "USDA_NUTRIENT_STDREF"."PUBLIC"."S3TESTBKT"
SET URL = 's3://my-s3-tstbkt';
```
I believe issue is with your copy command. Try following steps:
Execute list command to get list of files:
List #S3TESTBKT
if your source file appear here just make sure folder name in your copy command.
COPY INTO WEIGHT_INGEST
FROM #S3TESTBKT/
FILES = ('WEIGHT.txt')
FILE_FORMAT = (FORMAT_NAME = USDA_FILE_FORMAT);
I am using following configuration for pushing data to hdfs from log file.
agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity=5000
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt
agent.sources.tail-source.channels = memory-channel
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.batchSize=10
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/flume/data/log.txt
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
I got no error message, but still i m not able to find out the output in hdfs.
on interrupting I can see sink interruption exception & some data of that log file.
I am running following command:
flume-ng agent --conf /etc/flume-ng/conf/ --conf-file /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent;
I had a similar issue. In my case now it's working. Below is the conf file:
#Exec Source
execAgent.sources=e
execAgent.channels=memchannel
execAgent.sinks=HDFS
#channels
execAgent.channels.memchannel.type=file
execAgent.channels.memchannel.capacity = 20000
execAgent.channels.memchannel.transactionCapacity = 1000
#Define Source
execAgent.sources.e.type=org.apache.flume.source.ExecSource
execAgent.sources.e.channels=memchannel
execAgent.sources.e.shell=/bin/bash -c
execAgent.sources.e.fileHeader=false
execAgent.sources.e.fileSuffix=.txt
execAgent.sources.e.command=cat /home/sample.txt
#Define Sink
execAgent.sinks.HDFS.type=hdfs
execAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/
execAgent.sinks.HDFS.hdfs.fileType=DataStream
execAgent.sinks.HDFS.hdfs.writeFormat=Text
execAgent.sinks.HDFS.hdfs.batchSize=1000
execAgent.sinks.HDFS.hdfs.rollSize=268435
execAgent.sinks.HDFS.hdfs.rollInterval=0
#Bind Source Sink Channel
execAgent.sources.e.channels=memchannel
execAgent.sinks.HDFS.channel=memchannel
I suggest using the prefix configuration when placing files in HDFS:
agent.sinks.hdfs-sink.hdfs.filePrefix = log.out
#bhavesh - Are you sure, the log file(agent.sources.tail-source.command = tail -F /home/training/Downloads/log.txt) keeps appending data ? Since you have used a Tail command with -F, only changed data(within the file) will be dumped into HDFS
I am new to apache flume.
I am trying to see how I can get a json (as http source), parse it and store it to a dynamic path on hdfs according to the content.
For example:
if the json is:
[{
"field1" : "value1",
"field2" : "value2"
}]
then the hdfs path will be:
/some-default-root-path/value1/value2/some-value-name-file
Is there such configuration of flume that enables me to do that?
Here is my current configuration (accepts a json via http, and stores it in a path according to timestamp):
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Thanks!
The solution was in the flume documentation for the hdfs sink:
Here is the revised configuration:
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%{field1}
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
and the curl:
curl -X POST -d '[{ "headers" : { "timestamp" : "434324343", "host" :"random_host.example.com", "field1" : "val1" }, "body" : "random_body" }]' localhost:9000