I am using hadoop 2.0.4 and working in twitter sentiment analysis. I have used flume to ingest data but now the twitter data must be stored in hive table.
I have created a table but ROW FORMAT SERDE is giving error
'Unable to validate'
Kindly tell me how to proceed.
Are you using a custom SerDe?
Please refer to the below information provided in Language Manual of hive
You can create tables with a custom SerDe or using a native SerDe. A
native SerDe is used if ROW FORMAT is not specified or ROW FORMAT
DELIMITED is specified.
Hope the information is useful.
You can try adding this jar
hive-serdes-1.0-SNAPSHOT.jar
After adding the jar you can create an external hive table containing the tweet_id and the tweet_text which refers to the tweets directory for performing sentiment analysis like this.
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
You can refer to the below link for performing sentiment analysis using hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
Check weather you have added hive-serdes-1.0-SNAPSHOT.jar in your hive directory under lib folder. Your hive directory path will be the one which you have mentioned in your .bashrc file.
Related
I have a JSON File. I want to move only selected fields to Hive table. So below is the statement I used to create a new table to import the data from JSON file to HIVE Table. While creating it doesn't give any error but when i use select * from JsonFile1 or count(*) from JsonFile1 I get error as Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
I have browsed over the internet stuck with this since few days. I can't find a solution. I checked in the HDFS. I see there is a table created and complete file imported as-is(not just the fields I selected but all of it). I just provided the sample data, the actual data contains like 50+ field names. creating all the column names is cumbersome. Is that what we need to do? Thank you in advance.
CREATE EXTERNAL TABLE JsonFile1(user STRUCT<id:BIGINT,description:STRING, followers_count:INT>)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION 'link/data';
I have data as below
{filter_level":"low",geo":null,"user":{"id":859264394,"description":"I don’t want it. Building #techteam, #LetsTalk!!! def#abc.com",
"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name"krogmi",
"screen_name":"jkrogmi","id_str":"859264394",}}06:20:16 +0000 2012","default_profile_image":false,"followers_count":88,
"profile_sidebar_fill_color":"DDFFCC","screen_name":"abc_abc"}}
Answering my own question.
I have deleted the data in hdfs which I was pointing in the LOCATION '...', copied data again from local to hdfs and recreated the table again and it worked.
I am assuming that data was the problem.
I am trying to locate the property avro.schema.url that is part of the table meta data when a table is created by specifying the location to a avro schema file for some avro data in s3 or hdfs. I am able to see it in the output when I run the describe extended table command, but within the metastore database, where is this property stored? I searched the table_params for that particular table_id and did not find it ?
found it, its in SERDE_PARAMS table
I need to build generic file ingestion into Hive. The files are very large (2GB+), can be fixed or comma-separated, ASCII or EBCDIC files. After trying various techniques using Talend, I am looking into SERDE. If I ingest the files as-is and use a schema file (containing ordinal position, column name, type, length), can I create a custom SERDE to de-serialize any input file into hive rows? How performant would it be?
Since asking this question, I found that I could use a COBOL custom SERDE.
I am also looking at regex SERDE for positional files.
When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.
I tried creating a hive external table:
CREATE EXTERNAL TABLE TestXML (storexml string)
STORED AS TEXTFILE
LOCATION 'wasb:///test/';
However when i try executing query like below, its not able to extract the fields:
SELECT
xpath_string (storexml, '/trades/trade/USI')
FROM TestXML;
I saw a post, that talked about specifying the input format.
add JARS <>
set xmlinput.element=Store;
CREATE EXTERNAL TABLE EventStoreXML (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'wasb:///eventstore#tradedata.blob.core.windows.net/';
I could not determine, which jars to include in the add JARs statement. I am using HDInsight on Linux.
Any pointers will be appreciated.
-Madhu
Realised the issue was with the XML having carriage return, as a result it was not able to read the XML.