Hive with regexserde property doesn't work properly

Hive with regexserde property doesn't work properly - hive

I used regex101 website to validate my regex:
([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"
It works fine for the log below
66.240.70.141 - - [01/Mar/2018:06:16:46 +0000] "GET /example.download.handler.com/products/01/00/item/116314/8/002394857_2BB.jpg HTTP/1.1" 200 41710 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB30P) AppleWebKit/536.37 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" "C0T1_19610|3881001|"
But the same expression doesn't work on hive:
CREATE EXTERNAL TABLE `web_logs_test`(
`ip_address` string COMMENT '',
`date_string` string COMMENT '',
`request` string COMMENT '',
`status` string COMMENT '',
`bytes` string COMMENT '',
`referer` string COMMENT '',
`user_agent` string COMMENT '',
`cookie` string COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/weblogs/data'
If anyone knows, kindly help me out.
Thanks in advance.

CREATE EXTERNAL TABLE web_logs (
ip_address STRING,
date_string STRING,
request STRING,
status STRING,
bytes STRING,
referer STRING,
user_agent STRING,
cookie STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([\\d.]+) \\S+ \\S+ \\[(.+?)\\] \\\"(.+?)\\\" (\\d{3}) (\\d+) \\\"(.+?)\\\" \\\"(.+?)\\\" \\\"SESSIONID=(\\d+)\\\"\\s*"
)
LOCATION '/file_location/web_logs';

Related

Create JSON from XML - JSON_AGG OUTPUT PROBLEM

I have a problem with converting XML content to JSON format (with plain oracle select statement), where more then 1 sub level of data is present in the original XML - with my code the result of level 2+ is presented as string and not as JSON_OBJECT. Please, could someone tell me, where is fault in my code or what I'm doing wrong:
source:
<envelope>
<sender>
<name>IZS</name>
<country>SU</country>
<address>LOCATION 10B</address>
<address>1000 CITY</address>
<sender_identifier>SU46794093</sender_identifier>
<sender_address>
<sender_agent>SKWWSI20XXX</sender_agent>
<sender_mailbox>SI56031098765414228</sender_mailbox>
</sender_address>
</sender>
</envelope>
transformation select statement:
WITH SAMPLE AS (SELECT XMLTYPE ('
<envelope>
<sender>
<name>IZS</name>
<country>SU</country>
<address>LOCATION 10B</address>
<address>1000 CITY</address>
<sender_identifier>SU46794093</sender_identifier>
<sender_address>
<sender_agent>SKWWSI20XXX</sender_agent>
<sender_mailbox>SI56031098765414228</sender_mailbox>
</sender_address>
</sender>
</envelope>') XMLDOC FROM DUAL)
SELECT JSON_SERIALIZE (
JSON_OBJECT (
KEY 'envelope' VALUE
JSON_OBJECTAGG (
KEY ID_LEVEL1 VALUE
CASE ID_LEVEL1
WHEN 'sender' THEN
( SELECT JSON_OBJECTAGG (
KEY ID_LEVEL2 VALUE
CASE ID_LEVEL2
WHEN 'sender_address' THEN
( SELECT JSON_OBJECTagg (KEY ID_LEVEL22 VALUE TEXT_LEVEL22)
FROM XMLTABLE ('/sender/sender_address/*'
PASSING XML_LEVEL2
COLUMNS ID_LEVEL22 VARCHAR2 (128) PATH './name()',
TEXT_LEVEL22 VARCHAR2 (128) PATH './text()'
)
)
ELSE
TEXT_LEVEL2
END)
FROM XMLTABLE ('/sender/*'
PASSING XML_LEVEL2
COLUMNS ID_LEVEL2 VARCHAR2 (1024) PATH './name()',
TEXT_LEVEL2 VARCHAR2 (1024) PATH './text()'
)
)
ELSE
'"' || TEXT_LEVEL1 || '"'
END FORMAT JSON)
) PRETTY
)JSON_DOC
FROM SAMPLE, XMLTABLE ('/envelope/*'
PASSING XMLDOC
COLUMNS ID_LEVEL1 VARCHAR2 (1024) PATH './name()',
TEXT_LEVEL1 VARCHAR2 (1024) PATH './text()',
XML_LEVEL2 XMLTYPE PATH '.'
);
wrong result:
{
"envelope" :
{
"sender" :
{
"name" : "IZS",
"country" : "SU",
"address" : "LOCATION 10B",
"address" : "1000 CITY",
"sender_identifier" : "SU46794093",
"sender_address" : "{\"sender_agent\":\"SKWWSI20XXX\",\"sender_mailbox\":\"SI56031098765414228\"}"
}
}
}
wrong part:
***"sender_address" : "{\"sender_agent\":\"SKWWSI20XXX\",\"sender_mailbox\":\"SI56031098765414228\"}"***

For the level 1 text you're wrapping the value in double-quotes and specifying format json; you aren't doing that for level 2. If you change:
ELSE
TEXT_LEVEL2
END
to:
ELSE
'"' || TEXT_LEVEL2 || '"'
END FORMAT JSON)
then the result is:
{
  "envelope" :
  {
    "sender" :
    {
      "name" : "IZS",
      "country" : "SU",
      "address" : "LOCATION 10B",
      "address" : "1000 CITY",
      "sender_identifier" : "SU46794093",
      "sender_address" :
      {
        "sender_agent" : "SKWWSI20XXX",
        "sender_mailbox" : "SI56031098765414228"
      }
    }
  }
}
fiddle

The problem is that you need kind of conditional "FORMAT JSON" in the "SELECT JSON_OBJECTAGG ( KEY ID_LEVEL2 VALUECASE ID_LEVEL2": when the ID_LEVEL2 is 'sender_address' but not in the ELSE part, but the syntax requires you put after the END of CASE, and of course this fails for the "ELSE TEXT_LEVEL2" part.

How can I insert data into this structure in BigQuery?

In BigQuery, I want to insert some data into this very simple data structure:
Field Type Mode
id STRING NULLABLE
policies RECORD REPEATED
s RECORD NULLABLE
something STRING NULLABLE
riskTypes RECORD REPEATED
code STRING NULLABLE
In the light of my previous question I would expect the syntax to be as follows:
UPDATE `tablename` SET policies = ARRAY_CONCAT(
policies, [
struct<s struct<something STRING, riskTypes ARRAY<struct<code STRING>>>
("example something", [("example description")])
]
)
WHERE id = 'Moose';
But this gives an error:
Unexpected "[" (before the "example description")

Below should work
update `tablename` set policies = policies || [
struct<s struct<something string, riskTypes array<struct<code string>> >>
(struct('example something' as something, [struct('example description 1' as code), struct('example description 2')]))
]
where id = 'Moose'

How to update XML attribute in clob Oracle using XMLQuery

Oracle table name: SR_DATA;
Table field name: XMLDATA type CLOB;
Field value:
<module xmlns="http://www.mytest.com/2008/FMSchema">
<tmEsObjective modelCodeScheme="A" modelCodeSchemeVersion="01" modelCodeValue="ES_A"></tmEsObjective>
</module>
I need to update the value of the attribute "modelCodeValue" into ES_B.
This is the code:
UPDATE SR_DATA
SET XMLDATA =
XMLQuery('copy $i := $p1 modify
((for $j in $i/module/tmEsObjective/#modelCodeValue
return replace value of node $j with $p2))
)
return $i'
PASSING XMLType(REPLACE(xmldata, 'xmlns="http://www.mytest.com/2008/FMSchema"', '')) AS "p1",
'ES_B' AS "p2"
RETURNING CONTENT);
This code returns the error code: ORA-00932: inconsistent datatypes: expected CLOB got -

Use getclobval() like this:
UPDATE SR_DATA
SET XMLDATA =
XMLTYPE.GETCLOBVAL(XMLQuery('copy $i := $p1 modify
((for $j in $i/module/tmEsObjective/#modelCodeValue
return replace value of node $j with $p2))
return $i'
PASSING XMLType(REPLACE(xmldata, 'xmlns="http://www.mytest.com/2008/FMSchema"', '')) AS "p1",
'ES_B' AS "p2"
RETURNING CONTENT ));

find the owner of EC2 instance by Athena and CloudTrail

In order to know the owner of each EC2 instance, I query the cloudtrail logs stored in S3 by Athena.
I have a table in Athena with the following stucture:
CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
useridentity STRUCT<
type:STRING,
principalid:STRING,
arn:STRING,
accountid:STRING,
invokedby:STRING,
accesskeyid:STRING,
userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
mfaauthenticated:STRING,
creationdate:STRING>,
sessionissuer:STRUCT<
type:STRING,
principalId:STRING,
arn:STRING,
accountId:STRING,
userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
ARN:STRING,
accountId:STRING,
type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<BUCKET>/AWSLogs/';
I want to find the identity of the user who launch an EC2 instances so I need to parse the field responseelements and only get the rows with responseelements that has a particular instanceID.
the field responseelements is like this:
{
"requestId":"cab34472-31cc-44cd-ae32-a84077e55cb6",
"reservationId":"r-05964c8549788ac50",
"ownerId":"xxxxxxxxxx",
"groupSet":{},
"instancesSet":{
"items":[
{"instanceId":"i-043543cb4c12",
"imageId":"ami-078df974",
"instanceState":{"code":0,"name":"pending"},
"privateDnsName":"ip-444444.eu-west-1.compute.internal",
"keyName":"key-dev","amiLaunchIndex":0,"productCodes":{},
"instanceType":"t2.large",
"launchTime":1488438050000,
"placement":{"availabilityZone":"eu-west-1b","tenancy":"default"},
"monitoring":{"state":"pending"},
"subnetId":"subnet-d8fffff",
"vpcId":"vpc-444435",
"privateIpAddress":"10.0.42.49",
"stateReason":{"code":"pending","message":"pending"},
"architecture":"x86_64",
"rootDeviceType":"ebs",
"rootDeviceName":"/dev/xvda",
"blockDeviceMapping":{},
"virtualizationType":"hvm",
"hypervisor":"xen",
"clientToken":"c6e53004-c561-437d-a642-196489ff297c_subnet-fffffffff",
"groupSet":{"items":[{"groupId":"sg-64878700","groupName":"MetamSecurityGroup"}]},
"sourceDestCheck":true,
"networkInterfaceSet":{
"items":[
{"networkInterfaceId":"eni-b16b66f0",
"subnetId":"subnet-dffffff",
"vpcId":"vpc-50fffff35",
"ownerId":"xxxxxxxx",
"status":"in-use",
"macAddress":"fdsfdsfsdfqdsf",
"privateIpAddress":"10.0.42.34234213",
"privateDnsName":"ip-1dddddd.eu-west-1.compute.internal",
"sourceDestCheck":true,
"groupSet":{"items":[{"groupId":"sg-64878700","groupName":"MetamSecurityGroup"}]},
"attachment":{"attachmentId":"eni-attach-45619121","deviceIndex":0,"status":"attaching","attachTime":1488438050000,"deleteOnTermination":true},
"privateIpAddressesSet":{"item":[{"privateIpAddress":"10ffffff","privateDnsName":"ip-ffffff.eu-west-1.compute.internal","primary":true}]},
"ipv6AddressesSet":{},
"tagSet":{}}]}
,"iamInstanceProfile":{"arn":"arn:aws:iam::xxxxx:instance-profile/infra-EC2InstanceProfile-1D59C5YR0LIYJ","id":"eeeeeeeeeeeeeeeeee"},
"ebsOptimized":false}
]
},
"requesterId":"226008221399"
}
This is my query that I tried:
SELECT DISTINCT eventsource, eventname, useridentity.userName, eventtime, json_extract(responseelements, '$.instanceId') as instance_id
FROM cloudtrail_logs
WHERE account = 'xxxxxxxxxxxxxxx'
AND eventname = 'RunInstances';
but this gives instance_id as an empty column.
How to properly get only instance_id from the resposneelement?

I found the right query to find the owner of an ECS instance. That might help someone!
SELECT DISTINCT eventsource, eventname, useridentity.userName, eventtime, json_extract(responseelements, '$.instancesSet.items[0].instanceId') as instance_id
FROM cloudtrail_logs
WHERE account = 'xxxxxxx'
AND eventname = 'RunInstances'
AND responseelements LIKE '%i-3434ecb4c12%'
;

Great answer!!! I was searching forever! Thank you! One small change. In my case account was not needed. It threw this error:
SYNTAX_ERROR: line 3:7: Column 'account' cannot be resolved
Here is how I run it:
SELECT DISTINCT eventsource,
eventname,
useridentity.userName,
eventtime,
json_extract(responseelements, '$.instancesSet.items[0].instanceId') as instance_id
FROM <myCloudLogTable>
WHERE eventname = 'RunInstances'
AND responseelements LIKE '<myinstanceId>';

Add quoted strings support to Antlr3 grammar

I'm trying to implement a grammar for parsing queries. Single query consists of items where each item can be either name or name-ref.
name is either mystring (only letters, no spaces) or "my long string" (letters and spaces, always quoted). name-ref is very similar to name and the only difference is that it should start with ref: (ref:mystring, ref:"my long string"). Query should contain at least 1 item (name or name-ref).
Here's what I have:
NAME: ('a'..'z')+;
REF_TAG: 'ref:';
SP: ' '+;
name: NAME;
name_ref: REF_TAG name;
item: name | name_ref;
query: item (SP item)*;
This grammar demonstrates what I basically need to get and the only feature is that it doesn't support long quoted strings (it works fine for names that doesn't have spaces).
SHORT_NAME: ('a'..'z')+;
LONG_NAME: SHORT_NAME (SP SHORT_NAME)*;
REF_TAG: 'ref:';
SP: ' '+;
Q: '"';
short_name: SHORT_NAME;
long_name: LONG_NAME;
name_ref: REF_TAG (short_name | (Q long_name Q));
item: (short_name | (Q long_name Q)) | name_ref;
query: item (SP item)*;
But that doesn't work. Any ideas what's the problem? Probably, that's important: my first query should be treated as 3 items (3 names) and "my first query" is 1 item (1 long_name).

ANTLR's lexer matches greedily: that is why input like my first query is being tokenized as LONG_NAME instead of 3 SHORT_NAMEs with spaces in between.
Simply remove the LONG_NAME rule and define it in the parser rule long_name.
The following grammar:
SHORT_NAME : ('a'..'z')+;
REF_TAG : 'ref:';
SP : ' '+;
Q : '"';
short_name : SHORT_NAME;
long_name : Q SHORT_NAME (SP SHORT_NAME)* Q;
name_ref : REF_TAG (short_name | (Q long_name Q));
item : short_name | long_name | name_ref;
query : item (SP item)*;
will parse the input:
my first query "my first query" ref:mystring
as follows:
However, you could also tokenize a quoted name in the lexer and strip the quotes from it with a bit of custom code. And removing spaces from the lexer could also be an option. Something like this:
SHORT_NAME : ('a'..'z')+;
LONG_NAME : '"' ~'"'* '"' {setText(getText().substring(1, getText().length()-1));};
REF_TAG : 'ref:';
SP : ' '+ {skip();};
name_ref : REF_TAG (SHORT_NAME | LONG_NAME);
item : SHORT_NAME | LONG_NAME | name_ref;
query : item+ EOF;
which would parse the same input as follows:
Note that the actual token LONG_NAME will be stripped of its start- and end-quote.

Here's a grammar that should work for your requirements:
SP: ' '+;
SHORT_NAME: ('a'..'z')+;
LONG_NAME: '"' SHORT_NAME (SP SHORT_NAME)* '"';
REF: 'ref:' (SHORT_NAME | LONG_NAME);
item: SHORT_NAME | LONG_NAME | REF;
query: item (SP item)*;
If you put this at the top:
grammar Query;
#members {
public static void main(String[] args) throws Exception {
QueryLexer lex = new QueryLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
QueryParser parser = new QueryParser(tokens);
try {
TokenSource ts = parser.getTokenStream().getTokenSource();
Token tok = ts.nextToken();
while (EOF != (tok.getType())) {
System.out.println("Got a token: " + tok);
tok = ts.nextToken();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
You should see the lexer break everything apart nicely (I hope ;-) )
hi there "long name" ref:shortname ref:"long name"
Should give:
Got a token: [#-1,0:1='hi',<6>,1:0]
Got a token: [#-1,2:2=' ',<7>,1:2]
Got a token: [#-1,3:7='there',<6>,1:3]
Got a token: [#-1,8:8=' ',<7>,1:8]
Got a token: [#-1,9:19='"long name"',<4>,1:9]
Got a token: [#-1,20:20=' ',<7>,1:20]
Got a token: [#-1,21:33='ref:shortname',<5>,1:21]
Got a token: [#-1,34:34=' ',<7>,1:34]
Got a token: [#-1,35:49='ref:"long name"',<5>,1:35]
I'm not 100% sure what the problem is with your grammar, but I suspect the issue relates to your definition of a LONG_NAME without the quotes. Perhaps you can see what the distinction is?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive with regexserde property doesn't work properly - hive

Related

Create JSON from XML - JSON_AGG OUTPUT PROBLEM

How can I insert data into this structure in BigQuery?

How to update XML attribute in clob Oracle using XMLQuery

find the owner of EC2 instance by Athena and CloudTrail

Add quoted strings support to Antlr3 grammar

Categories

Resources