How to load the data from HDFS to hive table - hive

Please tell me how to load the data from HDFS to hive table?? Because I lost the tweets which downloaded yesterday.
To load the data I used following.
LOAD DATA LOCAL INPATH '/user/hue/twitter/tweets/2017/03/10'
OVERWRITE INTO TABLE tweets
PARTITION (datehour=20170310).
Give me a correct query This is my table.I send it as two steps
CREATE EXTERNAL TABLE twitter.tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT < screen_name:STRING, name:STRING >,
retweet_count:INT
>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT <
screen_name:STRING,
name:STRING
>
>,
hashtags:ARRAY<STRUCT<text:STRING>>
>,
text STRING,
–user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING
>,
in_reply_to_screen_name STRING )
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/twitter';
The sample data : atttached just little
{"filter_level":"low","retweeted":false,"in_reply_to_screen_‌​name":null,"possibly‌​_sensitive":false,"t‌​runcated":false,"lan‌​g":"en","in_reply_to‌​_status_id_str":null‌​,"id":84064934204214‌​8865,"extended_entit‌​ies":{"media":[{"siz‌​es":{"thumb":{"w":15‌​0,"resize":"crop","h‌​":150},"small":{"w":‌​340,"resize":"fit","‌​h":340},"medium":{"w‌​":600,"resize":"fit"‌​,"h":600},"large":{"‌​w":960,"resize":"fit‌​","h":960}},"source_‌​user_id":15934076,
I found that to load the data from HDFS to hive table is,
LOAD DATA INPATH '/user/hue/twitter/tweets/2017/03/10' OVERWRITE INTO TABLE tweets PARTITION (datehour=20170310).
Is this correct and will I lost my source file?? If so what is the solution query??

Related

Athena gives extra rows than actual present in datasource in s3

I have stored data source in s3 and when querying it in athena and querying the total no of rows , its giving me more rows than present in csv file stored in s3 .
I have also given separate path for athena query result i.e different from the data source folder path of s3 .
Please help me with this , why athena is giving me extra rows and unknown values in them ,thus creating discrepancies in the data.
Please find the query i wrote create the table in athena
athena_client.start_query_execution(QueryString='create database cms_data',ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})
\t#Tables created for athena
context = {'Database': 'cms_data'}
athena_client.start_query_execution(QueryString='''CREATE EXTERNAL TABLE IF NOT EXISTS `cms_data`.`mpf_data` (
`State` String,
`County` String,
`Org_Name` String,
`Contract_ID` String,
`Plan_ID` double,
`Segment_ID` double,
`Plan_Type_Desc` String,
`Contract_Year` double,
`Category_Name` String,
`Service_Name` String,
`Limit_Flag` double,
`Authorization_Flag` double,
`Referral_Flag` double,
`Network_Description` String,
`Cost_Share` String )
\t ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
\t 'field.delim' = ','
) LOCATION 's3://cms-dashboard-automation/MPF_Data/'
TBLPROPERTIES ('has_encrypted_data'='false');
''',QueryExecutionContext = context,ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})

AWS Athena: line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF> while creating table

I am trying to create an empty table which contains some columns with determined datatypes in S3 with a command launched in Athena, but it is throwing me the following error:
line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF>
The query I'm executing is the following:
CREATE TABLE IF NOT EXISTS boards_raw_fields_v1 (
"uuid" bigint,
"source" string,
"raw_company_name" string,
"raw_contract_type" string,
"raw_employment_type" string,
"raw_working_hours_type" string,
"raw_all_locations" array<string>,
"raw_categories" array<string>,
"raw_industry" string)
PARTITIONED BY (
"year" string,
"month" string,
"day" string,
"hour" string,
"version" bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my-data/datalake-raw/boards/raw-fields/v1'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='snappy',
'projection.enabled'='false',
'typeOfData'='file')
In the URI:
s3://my-data/datalake-raw/boards/raw-fields/v1
IMPORTANT NOTE: All the folders are currently created except the last one v1
What am I doing wrong here in the process of creating the table?

Can't get Hive SerDe to work - returns 0 records

This is my second attempt of using SerDe. First one worked quiet well but now, I'm really struggling.
I got an XML of this structure:
This is the Hive table I created
CREATE TABLE raw_abc.text_abc
(
publicationid string,
parentid string,
id string,
level string,
usertypeid string,
name string,
assetcrossreferences_ordered string,
assetcrossreferences MAP<string, string>,
attributenames_ordered string,
attributenames map<string,string>,
seo_ordered string,
seo MAP<string, string>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.publicationid"="/ST:ECC-HierarchyMessage/#PublicationID",
"column.xpath.parentid"="/ST:ECC-HierarchyMessage/Product/#ParentID",
"column.xpath.id"="/ST:ECC-HierarchyMessage/Product/#ID",
"column.xpath.level"="/ST:ECC-HierarchyMessage/Product/#Level",
"column.xpath.usertypeid"="/ST:ECC-HierarchyMessage/Product/#UserTypeID",
"column.xpath.name"="/ST:ECC-HierarchyMessage/Product/#Name",
"column.xpath.assetcrossreferences_ordered"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/#Ordered",
"column.xpath.assetcrossreferences"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/AssetCrossReference",
"column.xpath.attributenames_ordered"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.attributenames"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.seo_ordered"="/ST:ECC-HierarchyMessage/Product/SEO/#Ordered",
"column.xpath.seo"="/ST:ECC-HierarchyMessage/Product/SEO"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location 's3a://ec-abc-dev/inbound/abc/abc/'
TBLPROPERTIES (
"xmlinput.start"="<ST:ECC-HierarchyMessage>",
"xmlinput.end"="</ST:ECC-HierarchyMessage>"
)
;
Table is created successfully, however,
when I try select * from raw_abc.text_abc , I get no records in return.
Any idea what's wrong here? I've spent the last 2 days trying to figure it out with no luck.
Thanks,
G

Store multiple elements in json files in AWS Athena

I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example,
[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]
I want to create a table in Athena corresponding to above data.
The query I wrote for creating the table:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
But if I do a SELECT query as follows,
SELECT * FROM sampledb.elb_logs4;
I get the following result:
1 {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"} {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"} {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}
The entire content of the json file is picked as one entry here.
How can I read each element of json file as one entry?
Edit: How can I read each subcolumn of image, i.e., each element of the map?
Thanks.
Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords

Hive - dynamic partitions: Long loading times with a lot of partitions when updating table

I run Hive via AWS EMR and have a jobflow that parses log data frequently into S3. I use dynamic partitions (date and log level) for my parsed Hive table.
One thing that is taking forever now when I have several gigabytes of data and a lot of partitions is when Hive is loading data to the table after the parsing is done.
Loading data to table default.logs partition (dt=null, level=null)
...
Loading partition {dt=2013-08-06, level=INFO}
Loading partition {dt=2013-03-12, level=ERROR}
Loading partition {dt=2013-08-03, level=WARN}
Loading partition {dt=2013-07-08, level=INFO}
Loading partition {dt=2013-08-03, level=ERROR}
...
Partition default.logs{dt=2013-03-05, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 1905, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=ERROR} stats: [num_files: 1, num_rows: 0, total_size: 4338, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 828250, raw_data_size: 0]
...
Partition default.logs{dt=2013-08-14, level=INFO} stats: [num_files: 5, num_rows: 0, total_size: 626629, raw_data_size: 0]
Partition default.logs{dt=2013-08-14, level=WARN} stats: [num_files: 4, num_rows: 0, total_size: 4405, raw_data_size: 0]
Is there a way to overcome this problem and reduce the loading times for this step?
I have already tried to archive old logs to Glacier via a bucket lifecycle rule in hopes that Hive would skip loading the archived partitions. Well, since this still keeps the file(path)s visible in S3 Hive recognizes the archived partitions anyway so no performance is gained.
Update 1
The loading of the data is done by simple inserting the data into the dynamically partitioned table
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs ;
from one table that contain the unparsed logs
CREATE EXTERNAL TABLE new_logs (
dt STRING,
time STRING,
thread STRING,
level STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
value STRING,
exception STRING,
version STRING
)
PARTITIONED BY (
server STRING,
app STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'org.maz.hadoop.mapred.LogFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://my-log/logs/${LOCATION}' ;
into the new (parsed) table
CREATE EXTERNAL TABLE logs (
time STRING,
thread STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
exception STRING,
value STRING,
server STRING,
app STRING,
version STRING
)
PARTITIONED BY (
dt STRING,
level STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://my-log/parsed-logs' ;
The input format (LogFileInputFormat) is responsible of parsing log entries to the desired log format.
Update 2
When I try the following
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs
WHERE dt > 'some old date';
Hive still loads all partitions in logs. If I on the other hand use static partitioning like
INSERT INTO TABLE logs PARTITION (dt='some date', level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, level
FROM new_logs
WHERE dt = 'some date';
Hive only loads the concerned partitions, but then I need to create one query for each date I think might be present in new_logs. Usually new_logs only contain log entries from today and yesterday it but might contain older entries as well.
Static partitioning are my solution of choice at the moment but aren't there any other (better) solutions to my problem?
During this slow phase, Hive takes the files it built for each partition and moves it from a temporary directory to a permanent directory. You can see this in the "explain extended" called a Move Operator.
So for each partition it's one move and an update to the metastore. I don't use EMR but I presume this act of moving files to S3 has high latency for each file it needs to move.
What's not clear from what you wrote is whether you're doing a full load each time you run. For example why do you have a 2013-03-05 partition? Are you getting new log data that contains this old date? If this data is already in your logs table you should modify your insert statement like
SELECT fields
FROM new_logs
WHERE dt > 'date of last run';
This way you'll only get a few buckets and only a few files to move. It's still wasteful to scan all this extra data from new_logs but you can solve that by partitioning new_logs.
AWS has improved HIVE Partition recovery time by more than an order of magnitude on EMR 3.2.x and above.
We have a HIVE table that has more than 20,000 partitions on S3. With prior versions of EMR, it used to take ~80 minutes to recover and now with 3.2.x/3.3.x, we are able to do it under 5 minutes.