I am very new here, I am trying to run the following code on my
cloudera quickstart VM.
CREATE TABLE apache_common_log (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"
[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE;
but I got some error:
failed: execution error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask, Cannot validate serde: org.apache.hadoop.hive.serde2.RegexSerde
I did some research, all the fields are STRING, and i have add jar
/usr/lib/hive/lib/hive-contrib.jar
/usr/lib/hive/lib/hive-serde.jar
/usr/lib/hive/lib/hive-common.jar
it still didn't work.
really need some help!
any input will be appreciated!!!
Related
I am trying to create an empty table which contains some columns with determined datatypes in S3 with a command launched in Athena, but it is throwing me the following error:
line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF>
The query I'm executing is the following:
CREATE TABLE IF NOT EXISTS boards_raw_fields_v1 (
"uuid" bigint,
"source" string,
"raw_company_name" string,
"raw_contract_type" string,
"raw_employment_type" string,
"raw_working_hours_type" string,
"raw_all_locations" array<string>,
"raw_categories" array<string>,
"raw_industry" string)
PARTITIONED BY (
"year" string,
"month" string,
"day" string,
"hour" string,
"version" bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my-data/datalake-raw/boards/raw-fields/v1'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='snappy',
'projection.enabled'='false',
'typeOfData'='file')
In the URI:
s3://my-data/datalake-raw/boards/raw-fields/v1
IMPORTANT NOTE: All the folders are currently created except the last one v1
What am I doing wrong here in the process of creating the table?
This is my second attempt of using SerDe. First one worked quiet well but now, I'm really struggling.
I got an XML of this structure:
This is the Hive table I created
CREATE TABLE raw_abc.text_abc
(
publicationid string,
parentid string,
id string,
level string,
usertypeid string,
name string,
assetcrossreferences_ordered string,
assetcrossreferences MAP<string, string>,
attributenames_ordered string,
attributenames map<string,string>,
seo_ordered string,
seo MAP<string, string>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.publicationid"="/ST:ECC-HierarchyMessage/#PublicationID",
"column.xpath.parentid"="/ST:ECC-HierarchyMessage/Product/#ParentID",
"column.xpath.id"="/ST:ECC-HierarchyMessage/Product/#ID",
"column.xpath.level"="/ST:ECC-HierarchyMessage/Product/#Level",
"column.xpath.usertypeid"="/ST:ECC-HierarchyMessage/Product/#UserTypeID",
"column.xpath.name"="/ST:ECC-HierarchyMessage/Product/#Name",
"column.xpath.assetcrossreferences_ordered"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/#Ordered",
"column.xpath.assetcrossreferences"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/AssetCrossReference",
"column.xpath.attributenames_ordered"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.attributenames"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.seo_ordered"="/ST:ECC-HierarchyMessage/Product/SEO/#Ordered",
"column.xpath.seo"="/ST:ECC-HierarchyMessage/Product/SEO"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location 's3a://ec-abc-dev/inbound/abc/abc/'
TBLPROPERTIES (
"xmlinput.start"="<ST:ECC-HierarchyMessage>",
"xmlinput.end"="</ST:ECC-HierarchyMessage>"
)
;
Table is created successfully, however,
when I try select * from raw_abc.text_abc , I get no records in return.
Any idea what's wrong here? I've spent the last 2 days trying to figure it out with no luck.
Thanks,
G
i am running the below query in athena
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';
in this query we need to mention the S3 location as follows
s3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/
what is the prefix that is mentioned in this
s3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/ the S3 location for logs is actually this s3://your_log_bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/
Am i missing something?
If your logs location is s3://your_log_bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/ then you don't need to define prefix value, simply keep this s3 location in Athena table's location.
FYI, Say if multiple api's load balancers are generating logs data in same s3 buckets, There would be different s3 path such as s3://your_log_bucket/api-v1, s3://your_log_bucket/api-v2 etc. here prefix is api-v1 while s3 location would be s3://your_log_bucket/api-v1/AWSLogs/AWS_account_ID/elasticloadbalancing/
table desc info
hive> desc log23;
OK
col_name data_type comment
17/05/25 10:49:12 INFO mapred.FileInputFormat: Total input files to process : 1
host string from deserializer
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
request_method string from deserializer
request_url string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
referer string from deserializer
agent string from deserializer
Time taken: 0.049 seconds, Fetched: 12 row(s)
apache log format serialize
serializationLib:org.apache.hadoop.hive.contrib.serde2.RegexSerDe, parameters:{output.format.string=%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s, serialization.format=1, input.regex=([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) "(.[A-Z]*) (.*) (.*)" (-|[0-9]*) (-|[0-9]*) "(.*)" "(.*)"})
Add a column using Alter query
hive> alter table log23 add columns (code string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Error: type expected at the position 0 of derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:<derived from deserializer:derived from deserializer:string but>'<' is found.`
I get an error like above failed. How do I add a column...?
Unfortunately you cannot add columns if you've used serde. It is a known issue:
https://issues.apache.org/jira/browse/HIVE-17713
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, "ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column 'c' from test_change's schema.
i tried the same but i am able to create a table and added the columns at the end:
create table log23 (host String, remote_host String);
alter table log23 add columns(code String);
which is working with textfile format. please let me know if you are using different file format so that i try to replicate the use.
Aim:- To parse and load log data into HIVE using SerDe feature.Facing an issue while retrieving the data using SELECT statement.
We created a table and are able to successfully load data. However, the select statement retrieves only NULL values.
Sample log data:
2013-02-21 00:13:48,916 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_5729677439273359430_1495
The RegEx we came up with to parse the above log is :
([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): ([[^ ]*\s]*)
Create Table
CREATE EXTERNAL TABLE log (
dt STRING,
time STRING,
loglevel STRING,
check STRING,
status STRING )
ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex"([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): ([[^ ]*\s]*)",
"output.format.string"="%1$s %2$s %3$s %4$s %5$s")
STORED AS TEXTFILE LOCATION '/tmp/log/';
We added the jar:
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u4.jar;
Load data:
load data local inpath "/tmp/logdata.txt" into table log;
Retreive data:
Select * from log LIMIT 1;
Output:
NULL NULL NULL NULL NULL
Sample Log data:
2013-02-21 00:13:48,916 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner:
Verification succeeded for blk_5729677439273359430_1495
2013-02-21 00:15:39,929 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner:
Verification succeeded for blk_-4787916211671845946_1464
Thanks in Advance!!
Please try this, Rubular link:
([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): (.*)
Looks like you should add "=" following "input.regex"
and usually, this kind of error is caused by regular expression doesn't FULLY match the input.