Aim:- To parse and load log data into HIVE using SerDe feature.Facing an issue while retrieving the data using SELECT statement.
We created a table and are able to successfully load data. However, the select statement retrieves only NULL values.
Sample log data:
2013-02-21 00:13:48,916 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_5729677439273359430_1495
The RegEx we came up with to parse the above log is :
([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): ([[^ ]*\s]*)
Create Table
CREATE EXTERNAL TABLE log (
dt STRING,
time STRING,
loglevel STRING,
check STRING,
status STRING )
ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex"([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): ([[^ ]*\s]*)",
"output.format.string"="%1$s %2$s %3$s %4$s %5$s")
STORED AS TEXTFILE LOCATION '/tmp/log/';
We added the jar:
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u4.jar;
Load data:
load data local inpath "/tmp/logdata.txt" into table log;
Retreive data:
Select * from log LIMIT 1;
Output:
NULL NULL NULL NULL NULL
Sample Log data:
2013-02-21 00:13:48,916 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner:
Verification succeeded for blk_5729677439273359430_1495
2013-02-21 00:15:39,929 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner:
Verification succeeded for blk_-4787916211671845946_1464
Thanks in Advance!!
Please try this, Rubular link:
([^ ]*) ([^ ]{8})[^ ]* ([A-Z]*) ([^ ]*): (.*)
Looks like you should add "=" following "input.regex"
and usually, this kind of error is caused by regular expression doesn't FULLY match the input.
Related
I have exported CloudWatch logs to S3 and now want to import those logs to Athena. The format of the logs is as follow (pasted only one log for reference):
2021-07-30T14:30:22.937Z RequestId INFO {"_logLevel":"debug","msg":"Start: Calling All the Data Associates Function","timestamp":1627655422937,"EventSubCategory":"AppSyncService","API":"AppSyncService","function":"XXXXXXXXXXXXXXXXX","Correlation_Id":"XXXXXXXXXXXXXXXXX"}
I am using a regular expression to import the log and using the following query to create the table.
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n'
)
LOCATION 's3://logs/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
Regular Expression:
^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n
There are four columns in the table and the regular expression is also creating four groups and working as per my expectation. However, we still get empty table as result.
Can anyone please help to resolve this issue?
I think your problem is that you need to double-escape things in the regex, and you also should not match on a newline at the end, but $. Try this pattern:
'input.regex' = '^(.*?)\\t(.*?)\\t(.*?)\\t([\\s\\S]*?)$'
You can see an example in the official docs.
Also, the pattern [\s\S] could be replaced by . (\S means everything not matched by \s, so together they match anything).
An alternative to the regex serde is Grok, which is less error prone to write. Using the Grok serde I think this table would work for you:
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
input.format' = '%{TIMESTAMP_ISO8601:time} %{NOTSPACE:requestid} %{NOTSPACE:loglevel} %{NOTSPACE:message}'
)
LOCATION 's3://logs/test/'
Grok patterns are much easier to read. Check out the documentation and the built-in patterns for more info.
I am trying to define a table that has a column that is an arrays of structs using standard sql. The docs here suggest this should work:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<INT64,INT64>>
)
but I get an error:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat xxxx.ddl.temp.sql | awk 'ORS=" "')"
Waiting on bqjob_r6735048b_00000173ed2d9645_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxxx-10843454-yyyyy-
dev:bqjob_r6735048b_00000173ed2d9645_1': Illegal field name:
Changing the field (edit: column!) name does not fix it. What I am doing wrong?
The fields within the struct need to be named so this works:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<x INT64,y INT64>>
)
i am running the below query in athena
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';
in this query we need to mention the S3 location as follows
s3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/
what is the prefix that is mentioned in this
s3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/ the S3 location for logs is actually this s3://your_log_bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/
Am i missing something?
If your logs location is s3://your_log_bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/ then you don't need to define prefix value, simply keep this s3 location in Athena table's location.
FYI, Say if multiple api's load balancers are generating logs data in same s3 buckets, There would be different s3 path such as s3://your_log_bucket/api-v1, s3://your_log_bucket/api-v2 etc. here prefix is api-v1 while s3 location would be s3://your_log_bucket/api-v1/AWSLogs/AWS_account_ID/elasticloadbalancing/
table desc info
hive> desc log23;
OK
col_name data_type comment
17/05/25 10:49:12 INFO mapred.FileInputFormat: Total input files to process : 1
host string from deserializer
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
request_method string from deserializer
request_url string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
referer string from deserializer
agent string from deserializer
Time taken: 0.049 seconds, Fetched: 12 row(s)
apache log format serialize
serializationLib:org.apache.hadoop.hive.contrib.serde2.RegexSerDe, parameters:{output.format.string=%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s, serialization.format=1, input.regex=([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) "(.[A-Z]*) (.*) (.*)" (-|[0-9]*) (-|[0-9]*) "(.*)" "(.*)"})
Add a column using Alter query
hive> alter table log23 add columns (code string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Error: type expected at the position 0 of derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:<derived from deserializer:derived from deserializer:string but>'<' is found.`
I get an error like above failed. How do I add a column...?
Unfortunately you cannot add columns if you've used serde. It is a known issue:
https://issues.apache.org/jira/browse/HIVE-17713
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, "ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column 'c' from test_change's schema.
i tried the same but i am able to create a table and added the columns at the end:
create table log23 (host String, remote_host String);
alter table log23 add columns(code String);
which is working with textfile format. please let me know if you are using different file format so that i try to replicate the use.
I am very new here, I am trying to run the following code on my
cloudera quickstart VM.
CREATE TABLE apache_common_log (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"
[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE;
but I got some error:
failed: execution error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask, Cannot validate serde: org.apache.hadoop.hive.serde2.RegexSerde
I did some research, all the fields are STRING, and i have add jar
/usr/lib/hive/lib/hive-contrib.jar
/usr/lib/hive/lib/hive-serde.jar
/usr/lib/hive/lib/hive-common.jar
it still didn't work.
really need some help!
any input will be appreciated!!!