Hive with regexserde property doesn't work properly - hive

I used regex101 website to validate my regex:
([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"
It works fine for the log below
66.240.70.141 - - [01/Mar/2018:06:16:46 +0000] "GET /example.download.handler.com/products/01/00/item/116314/8/002394857_2BB.jpg HTTP/1.1" 200 41710 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB30P) AppleWebKit/536.37 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" "C0T1_19610|3881001|"
But the same expression doesn't work on hive:
CREATE EXTERNAL TABLE `web_logs_test`(
`ip_address` string COMMENT '',
`date_string` string COMMENT '',
`request` string COMMENT '',
`status` string COMMENT '',
`bytes` string COMMENT '',
`referer` string COMMENT '',
`user_agent` string COMMENT '',
`cookie` string COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/weblogs/data'
If anyone knows, kindly help me out.
Thanks in advance.

CREATE EXTERNAL TABLE web_logs (
ip_address STRING,
date_string STRING,
request STRING,
status STRING,
bytes STRING,
referer STRING,
user_agent STRING,
cookie STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([\\d.]+) \\S+ \\S+ \\[(.+?)\\] \\\"(.+?)\\\" (\\d{3}) (\\d+) \\\"(.+?)\\\" \\\"(.+?)\\\" \\\"SESSIONID=(\\d+)\\\"\\s*"
)
LOCATION '/file_location/web_logs';

Related

Create JSON from XML - JSON_AGG OUTPUT PROBLEM

I have a problem with converting XML content to JSON format (with plain oracle select statement), where more then 1 sub level of data is present in the original XML - with my code the result of level 2+ is presented as string and not as JSON_OBJECT. Please, could someone tell me, where is fault in my code or what I'm doing wrong:
source:
<envelope>
<sender>
<name>IZS</name>
<country>SU</country>
<address>LOCATION 10B</address>
<address>1000 CITY</address>
<sender_identifier>SU46794093</sender_identifier>
<sender_address>
<sender_agent>SKWWSI20XXX</sender_agent>
<sender_mailbox>SI56031098765414228</sender_mailbox>
</sender_address>
</sender>
</envelope>
transformation select statement:
WITH SAMPLE AS (SELECT XMLTYPE ('
<envelope>
<sender>
<name>IZS</name>
<country>SU</country>
<address>LOCATION 10B</address>
<address>1000 CITY</address>
<sender_identifier>SU46794093</sender_identifier>
<sender_address>
<sender_agent>SKWWSI20XXX</sender_agent>
<sender_mailbox>SI56031098765414228</sender_mailbox>
</sender_address>
</sender>
</envelope>') XMLDOC FROM DUAL)
SELECT JSON_SERIALIZE (
JSON_OBJECT (
KEY 'envelope' VALUE
JSON_OBJECTAGG (
KEY ID_LEVEL1 VALUE
CASE ID_LEVEL1
WHEN 'sender' THEN
( SELECT JSON_OBJECTAGG (
KEY ID_LEVEL2 VALUE
CASE ID_LEVEL2
WHEN 'sender_address' THEN
( SELECT JSON_OBJECTagg (KEY ID_LEVEL22 VALUE TEXT_LEVEL22)
FROM XMLTABLE ('/sender/sender_address/*'
PASSING XML_LEVEL2
COLUMNS ID_LEVEL22 VARCHAR2 (128) PATH './name()',
TEXT_LEVEL22 VARCHAR2 (128) PATH './text()'
)
)
ELSE
TEXT_LEVEL2
END)
FROM XMLTABLE ('/sender/*'
PASSING XML_LEVEL2
COLUMNS ID_LEVEL2 VARCHAR2 (1024) PATH './name()',
TEXT_LEVEL2 VARCHAR2 (1024) PATH './text()'
)
)
ELSE
'"' || TEXT_LEVEL1 || '"'
END FORMAT JSON)
) PRETTY
)JSON_DOC
FROM SAMPLE, XMLTABLE ('/envelope/*'
PASSING XMLDOC
COLUMNS ID_LEVEL1 VARCHAR2 (1024) PATH './name()',
TEXT_LEVEL1 VARCHAR2 (1024) PATH './text()',
XML_LEVEL2 XMLTYPE PATH '.'
);
wrong result:
{
"envelope" :
{
"sender" :
{
"name" : "IZS",
"country" : "SU",
"address" : "LOCATION 10B",
"address" : "1000 CITY",
"sender_identifier" : "SU46794093",
"sender_address" : "{\"sender_agent\":\"SKWWSI20XXX\",\"sender_mailbox\":\"SI56031098765414228\"}"
}
}
}
wrong part:
***"sender_address" : "{\"sender_agent\":\"SKWWSI20XXX\",\"sender_mailbox\":\"SI56031098765414228\"}"***
For the level 1 text you're wrapping the value in double-quotes and specifying format json; you aren't doing that for level 2. If you change:
ELSE
TEXT_LEVEL2
END
to:
ELSE
'"' || TEXT_LEVEL2 || '"'
END FORMAT JSON)
then the result is:
{
  "envelope" :
  {
    "sender" :
    {
      "name" : "IZS",
      "country" : "SU",
      "address" : "LOCATION 10B",
      "address" : "1000 CITY",
      "sender_identifier" : "SU46794093",
      "sender_address" :
      {
        "sender_agent" : "SKWWSI20XXX",
        "sender_mailbox" : "SI56031098765414228"
      }
    }
  }
}
fiddle
The problem is that you need kind of conditional "FORMAT JSON" in the "SELECT JSON_OBJECTAGG ( KEY ID_LEVEL2 VALUECASE ID_LEVEL2": when the ID_LEVEL2 is 'sender_address' but not in the ELSE part, but the syntax requires you put after the END of CASE, and of course this fails for the "ELSE TEXT_LEVEL2" part.

How can I insert data into this structure in BigQuery?

In BigQuery, I want to insert some data into this very simple data structure:
Field Type Mode
id STRING NULLABLE
policies RECORD REPEATED
s RECORD NULLABLE
something STRING NULLABLE
riskTypes RECORD REPEATED
code STRING NULLABLE
In the light of my previous question I would expect the syntax to be as follows:
UPDATE `tablename` SET policies = ARRAY_CONCAT(
policies, [
struct<s struct<something STRING, riskTypes ARRAY<struct<code STRING>>>
("example something", [("example description")])
]
)
WHERE id = 'Moose';
But this gives an error:
Unexpected "[" (before the "example description")
Below should work
update `tablename` set policies = policies || [
struct<s struct<something string, riskTypes array<struct<code string>> >>
(struct('example something' as something, [struct('example description 1' as code), struct('example description 2')]))
]
where id = 'Moose'

How to update XML attribute in clob Oracle using XMLQuery

Oracle table name: SR_DATA;
Table field name: XMLDATA type CLOB;
Field value:
<module xmlns="http://www.mytest.com/2008/FMSchema">
<tmEsObjective modelCodeScheme="A" modelCodeSchemeVersion="01" modelCodeValue="ES_A"></tmEsObjective>
</module>
I need to update the value of the attribute "modelCodeValue" into ES_B.
This is the code:
UPDATE SR_DATA
SET XMLDATA =
XMLQuery('copy $i := $p1 modify
((for $j in $i/module/tmEsObjective/#modelCodeValue
return replace value of node $j with $p2))
)
return $i'
PASSING XMLType(REPLACE(xmldata, 'xmlns="http://www.mytest.com/2008/FMSchema"', '')) AS "p1",
'ES_B' AS "p2"
RETURNING CONTENT);
This code returns the error code: ORA-00932: inconsistent datatypes: expected CLOB got -
Use getclobval() like this:
UPDATE SR_DATA
SET XMLDATA =
XMLTYPE.GETCLOBVAL(XMLQuery('copy $i := $p1 modify
((for $j in $i/module/tmEsObjective/#modelCodeValue
return replace value of node $j with $p2))
return $i'
PASSING XMLType(REPLACE(xmldata, 'xmlns="http://www.mytest.com/2008/FMSchema"', '')) AS "p1",
'ES_B' AS "p2"
RETURNING CONTENT ));

find the owner of EC2 instance by Athena and CloudTrail

In order to know the owner of each EC2 instance, I query the cloudtrail logs stored in S3 by Athena.
I have a table in Athena with the following stucture:
CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
useridentity STRUCT<
type:STRING,
principalid:STRING,
arn:STRING,
accountid:STRING,
invokedby:STRING,
accesskeyid:STRING,
userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
mfaauthenticated:STRING,
creationdate:STRING>,
sessionissuer:STRUCT<
type:STRING,
principalId:STRING,
arn:STRING,
accountId:STRING,
userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
ARN:STRING,
accountId:STRING,
type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<BUCKET>/AWSLogs/';
I want to find the identity of the user who launch an EC2 instances so I need to parse the field responseelements and only get the rows with responseelements that has a particular instanceID.
the field responseelements is like this:
{
"requestId":"cab34472-31cc-44cd-ae32-a84077e55cb6",
"reservationId":"r-05964c8549788ac50",
"ownerId":"xxxxxxxxxx",
"groupSet":{},
"instancesSet":{
"items":[
{"instanceId":"i-043543cb4c12",
"imageId":"ami-078df974",
"instanceState":{"code":0,"name":"pending"},
"privateDnsName":"ip-444444.eu-west-1.compute.internal",
"keyName":"key-dev","amiLaunchIndex":0,"productCodes":{},
"instanceType":"t2.large",
"launchTime":1488438050000,
"placement":{"availabilityZone":"eu-west-1b","tenancy":"default"},
"monitoring":{"state":"pending"},
"subnetId":"subnet-d8fffff",
"vpcId":"vpc-444435",
"privateIpAddress":"10.0.42.49",
"stateReason":{"code":"pending","message":"pending"},
"architecture":"x86_64",
"rootDeviceType":"ebs",
"rootDeviceName":"/dev/xvda",
"blockDeviceMapping":{},
"virtualizationType":"hvm",
"hypervisor":"xen",
"clientToken":"c6e53004-c561-437d-a642-196489ff297c_subnet-fffffffff",
"groupSet":{"items":[{"groupId":"sg-64878700","groupName":"MetamSecurityGroup"}]},
"sourceDestCheck":true,
"networkInterfaceSet":{
"items":[
{"networkInterfaceId":"eni-b16b66f0",
"subnetId":"subnet-dffffff",
"vpcId":"vpc-50fffff35",
"ownerId":"xxxxxxxx",
"status":"in-use",
"macAddress":"fdsfdsfsdfqdsf",
"privateIpAddress":"10.0.42.34234213",
"privateDnsName":"ip-1dddddd.eu-west-1.compute.internal",
"sourceDestCheck":true,
"groupSet":{"items":[{"groupId":"sg-64878700","groupName":"MetamSecurityGroup"}]},
"attachment":{"attachmentId":"eni-attach-45619121","deviceIndex":0,"status":"attaching","attachTime":1488438050000,"deleteOnTermination":true},
"privateIpAddressesSet":{"item":[{"privateIpAddress":"10ffffff","privateDnsName":"ip-ffffff.eu-west-1.compute.internal","primary":true}]},
"ipv6AddressesSet":{},
"tagSet":{}}]}
,"iamInstanceProfile":{"arn":"arn:aws:iam::xxxxx:instance-profile/infra-EC2InstanceProfile-1D59C5YR0LIYJ","id":"eeeeeeeeeeeeeeeeee"},
"ebsOptimized":false}
]
},
"requesterId":"226008221399"
}
This is my query that I tried:
SELECT DISTINCT eventsource, eventname, useridentity.userName, eventtime, json_extract(responseelements, '$.instanceId') as instance_id
FROM cloudtrail_logs
WHERE account = 'xxxxxxxxxxxxxxx'
AND eventname = 'RunInstances';
but this gives instance_id as an empty column.
How to properly get only instance_id from the resposneelement?
I found the right query to find the owner of an ECS instance. That might help someone!
SELECT DISTINCT eventsource, eventname, useridentity.userName, eventtime, json_extract(responseelements, '$.instancesSet.items[0].instanceId') as instance_id
FROM cloudtrail_logs
WHERE account = 'xxxxxxx'
AND eventname = 'RunInstances'
AND responseelements LIKE '%i-3434ecb4c12%'
;
Great answer!!! I was searching forever! Thank you! One small change. In my case account was not needed. It threw this error:
SYNTAX_ERROR: line 3:7: Column 'account' cannot be resolved
Here is how I run it:
SELECT DISTINCT eventsource,
eventname,
useridentity.userName,
eventtime,
json_extract(responseelements, '$.instancesSet.items[0].instanceId') as instance_id
FROM <myCloudLogTable>
WHERE eventname = 'RunInstances'
AND responseelements LIKE '<myinstanceId>';

Add quoted strings support to Antlr3 grammar

I'm trying to implement a grammar for parsing queries. Single query consists of items where each item can be either name or name-ref.
name is either mystring (only letters, no spaces) or "my long string" (letters and spaces, always quoted). name-ref is very similar to name and the only difference is that it should start with ref: (ref:mystring, ref:"my long string"). Query should contain at least 1 item (name or name-ref).
Here's what I have:
NAME: ('a'..'z')+;
REF_TAG: 'ref:';
SP: ' '+;
name: NAME;
name_ref: REF_TAG name;
item: name | name_ref;
query: item (SP item)*;
This grammar demonstrates what I basically need to get and the only feature is that it doesn't support long quoted strings (it works fine for names that doesn't have spaces).
SHORT_NAME: ('a'..'z')+;
LONG_NAME: SHORT_NAME (SP SHORT_NAME)*;
REF_TAG: 'ref:';
SP: ' '+;
Q: '"';
short_name: SHORT_NAME;
long_name: LONG_NAME;
name_ref: REF_TAG (short_name | (Q long_name Q));
item: (short_name | (Q long_name Q)) | name_ref;
query: item (SP item)*;
But that doesn't work. Any ideas what's the problem? Probably, that's important: my first query should be treated as 3 items (3 names) and "my first query" is 1 item (1 long_name).
ANTLR's lexer matches greedily: that is why input like my first query is being tokenized as LONG_NAME instead of 3 SHORT_NAMEs with spaces in between.
Simply remove the LONG_NAME rule and define it in the parser rule long_name.
The following grammar:
SHORT_NAME : ('a'..'z')+;
REF_TAG : 'ref:';
SP : ' '+;
Q : '"';
short_name : SHORT_NAME;
long_name : Q SHORT_NAME (SP SHORT_NAME)* Q;
name_ref : REF_TAG (short_name | (Q long_name Q));
item : short_name | long_name | name_ref;
query : item (SP item)*;
will parse the input:
my first query "my first query" ref:mystring
as follows:
However, you could also tokenize a quoted name in the lexer and strip the quotes from it with a bit of custom code. And removing spaces from the lexer could also be an option. Something like this:
SHORT_NAME : ('a'..'z')+;
LONG_NAME : '"' ~'"'* '"' {setText(getText().substring(1, getText().length()-1));};
REF_TAG : 'ref:';
SP : ' '+ {skip();};
name_ref : REF_TAG (SHORT_NAME | LONG_NAME);
item : SHORT_NAME | LONG_NAME | name_ref;
query : item+ EOF;
which would parse the same input as follows:
Note that the actual token LONG_NAME will be stripped of its start- and end-quote.
Here's a grammar that should work for your requirements:
SP: ' '+;
SHORT_NAME: ('a'..'z')+;
LONG_NAME: '"' SHORT_NAME (SP SHORT_NAME)* '"';
REF: 'ref:' (SHORT_NAME | LONG_NAME);
item: SHORT_NAME | LONG_NAME | REF;
query: item (SP item)*;
If you put this at the top:
grammar Query;
#members {
public static void main(String[] args) throws Exception {
QueryLexer lex = new QueryLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
QueryParser parser = new QueryParser(tokens);
try {
TokenSource ts = parser.getTokenStream().getTokenSource();
Token tok = ts.nextToken();
while (EOF != (tok.getType())) {
System.out.println("Got a token: " + tok);
tok = ts.nextToken();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
You should see the lexer break everything apart nicely (I hope ;-) )
hi there "long name" ref:shortname ref:"long name"
Should give:
Got a token: [#-1,0:1='hi',<6>,1:0]
Got a token: [#-1,2:2=' ',<7>,1:2]
Got a token: [#-1,3:7='there',<6>,1:3]
Got a token: [#-1,8:8=' ',<7>,1:8]
Got a token: [#-1,9:19='"long name"',<4>,1:9]
Got a token: [#-1,20:20=' ',<7>,1:20]
Got a token: [#-1,21:33='ref:shortname',<5>,1:21]
Got a token: [#-1,34:34=' ',<7>,1:34]
Got a token: [#-1,35:49='ref:"long name"',<5>,1:35]
I'm not 100% sure what the problem is with your grammar, but I suspect the issue relates to your definition of a LONG_NAME without the quotes. Perhaps you can see what the distinction is?