Fetch query params from S3 access log using Athena - hive

I wish to fetch a map of query params from S3 access log using Athena.
E.g. for the following log line example:
283e.. foo [17/Jun/2017:23:00:49 +0000] 76.117.221.205 - 1D0.. REST.GET.OBJECT 1x1.gif "GET /foo.bar/1x1.gif?placement_tag_id=0&r=574&placement_hash=12345... HTTP/1.1" 200 ... "Mozilla/5.0"
I want to get a map queryParams of [k, v]:
placement_tag_id,0
r,574
placement_hash,12345
So I'll be able to run queries such as:
select * from accessLogs where queryParams.placement_tag_id=0 and X.r>=500
The query params count and content differ from one request to another so I can't use a static RegEx pattern.
I used serde2.RegexSerDe on the following Athena create table query to make a basic split of the log, but didn't find a method to achieve what I want.
I thought of using MultiDelimitSerDe but it's not supported in Athena.
Any suggestion on how to achieve that?
CREATE EXTERNAL TABLE IF NOT EXISTS elb_db.accessLogs (
timestamp string,
request string,
http_status string,
user_agent string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '[^ ]* [^ ]* \\[(.*)\\] [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* "(.*?)" ([^ ]*) [^ ]* [^ ]* [^ ]* [^ ]* [^ ]* ".*?" "(.*?)" [^ ]*'
) LOCATION 's3://output/bucket'

Related

SQL - REGEX Match the whole word if it has an # symbol in it

So I am using Snowflake and specifically the REGEXP_REPLACE function. I am looking for a Regex expression that will match any word with an # symbol in it in a text field.
Example:
RAW_DATA
CLEANED_DATA
here is a sample and then an email#gmail.com
here is a sample and then an xxxxx
abc#test.com
xxxxx
What I have tried so far is:
Select regexp_replace('ABC#gmail.com' , '(([a-zA-Z]+)(\W)?([a-zA-Z]+))', 'xxxxxxx') as result;
Result:
xxxxxxx#xxxxxxx.xxxxxxx
You can use
Select regexp_replace('here is a sample and then an email#gmail.com' , '\\S+#\\S+', 'xxxxx') as result;
Here,
\S+ - one or more non-whitespace chars
# - a # char
\S+ - one or more non-whitespace chars

Remove square brackets if content within the square bracket does not contain spaces

I have a requirement where in I need to convert sql from 1 format to another format.
Below is the sample example
select [Project_ID] AS [Project ID]
Convert the above line to:
select Project_ID AS "Project ID"
So the way I am thinking is 2 step strategy
If I can somehow remove the [] for the 1s which does not have spaces in between. May be via Regex
Replace [ and ] with " for the rest 1s.
I have more than 10K lines of code which needs to be changed. Manual work would take me a lot of time to get this thing done.
You can do both replacements in a single step:
Ctrl+H
Find what: \[(\w+)\]|\[([^\]]+)\]
Replace with: (?1$1:(?2"$2"))
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
\[ # opening square bracket
(\w+) # group 1, 1 or more word character
\] # closing square bracket
| # OR
\[ # opening square bracket
([^\]]+) # group 2, 1 or more any character that is not a closing bracket
\] # closing square bracket
Replacement:
(?1 # if group 1 exists:
$1 # replace with content of group 1
: # else
(?2 # if group 2 exists:
"$2" # replace with content of group 2 suround with quotes
) # endif group 2
) # endif group 1
Screenshot (before):
Screenshot (after):
You can use two regular expressions to get the desired format.
Follow these steps:
Backup your main file. DO this in another temp file.
Do Ctrl + H and select "Regular expressions" in search mode
write in "find what" box: (?i)(?<=select)\s+[|]\s+(?=as)
then click replce all.
write in "find what" box: [([^[]]+)]
put in "replace box" and click replace all: "\1"
This is a simple task for two regular expression replaces.
First replace \[([^\[\] \r\n]+)\] with $1. This removes the brackets from non-space strings.
Next replace \[([^\[\] \r\n]+ [^\[\]\r\n]+)\] with "$1". This replaces brackets with double quotes on strings with spaces.
Note both replacements include \[\] and \r\n. There restricts the replacements to strings that do not themselves contain brackets and to strings that do not contain newlines.
Please make a backup of your file before doing this sort of edit. Because the above does two separate replaces it is possible that lines that have nested brackets will be treated wrongly.
If all of the lines of code are in the same format as select [Project_ID] AS [Project ID] then you can do this in the following order:
1: Replace ] as [ with AS " -- (Space AS ")
2: Replace [ with -- (Space)
3: Replace ] with "
EDIT:
If the data is in a mixed format:
select [Project_ID] AS [Project ID] and perhaps
select [Project_ID] [Project ID]
then you can do this in the following order:
1: Replace ] as [ with AS " -- (Space AS ")
2: Replace ] [ with " -- (Space ")
3: Replace [ with -- (Space)
4: Replace ] with "
Not the most efficient way. But OP is fine with awk
awk -F'[' '{gsub(/\]/ , "", $2); gsub(/\]/ ,"" ,$3); $3="\042"$3"\042"; print}' ;
Demo :
$echo "select [Project_ID] AS [Project ID]" | awk -F'[' '{gsub(/\]/ , "", $2); gsub(/\]/ ,"" ,$3); $3="\042"$3"\042"; print}' ;
select Project_ID AS "Project ID"
$
Explanation :
awk -F'[' -- Set delimiter as "["
gsub(/\]/ , "", $2); -- Replace ] with "" in second field. We pass regular expression between / / and \ - is escape character
gsub(/\]/ ,"" ,$3) -- Remove ] from third field.
$3="\042"$3"\042" -- Concat double quotes at start and end of 3rd field.

How to cut off strings before 4th occurrence of a character in postgresql?

I know in a lot of databases the charindex function can accept getting the character on the third occurrence but the strpos in postgresql doesn't accept it.
Basically I'd need to cut everything after the 4th space (including the last space)
If I have a string like:
FLAT 11, ELMER HOUSE 33-35
How to cut it after the 'HOUSE' to turn it into just:
FLAT 11, ELMER HOUSE
And no, using left won't work because these strings are very variable.
Here is one method:
select substring(str || ' ' from '^[^ ]* [^ ]* [^ ]*')
This looks for groups of characters (possibly empty) separated by a space. It goes for everything up to the fourth space in str.
The || ' ' is just to be sure that there are four spaces on the string. Otherwise the pattern will return NULL.

Multiple Patterns in Regex

Can there be multiple patterns in Regexp_Replace.
Pattern 1 : '^#.*'
Pattern 2: '^//.*'
Pattern 3 : '^&&.*'
I want all three patterns in same regexp_replace function like
select REGEXP_REPLACE ('Unit testing last level','Pattern 1,Pattern 2,Pattern 3','',1,0,'m')
from dual;
You can use an alternation group where all alternative branches are |-separated.
^(#|//|&&).*
The (...) form a grouping construct where you may place your various #, &&, and other possible "branches". A | is an alternation operator.
The pattern will match:
^ - start of a line (as you are passing m match_parameter)
(#|//|&&) - either #, // or &&
.* - any 0+ chars other than a newline (since n match_parameter is not used).

regex - lazy and non-capturing

String to search:
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Regex search so far:
VALUES\s*\(\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*,\s*'?\s*(.+?)\s*'?\s*\)
Regex replace to 3 groups: ie \1 \2 \3
I am aiming for a result of:
9gfdg to_date('1876/12/06' ,'YYYY/MM/DD') null
but instead get (because of that extra comma in to_Date and also lazy instead of greedy):
9gfdg to_date('1876/12/06 YYYY/MM/DD , null)
Note:
It is exactly 3 fields (the values within th 3 fields may be different but you get the idea of the format I am grappling with). ie each of the fields could have commas (usually character values, could be a keyword such as null, could be a number or could be a to_Date expression.
Regex engine is VBA/VBscript
Anyone have any pointers on fixing up this regex?
Here is a solution.
Notice the regex for $field: it is yet another application of the normal* (special normal*)* pattern, with normal being anything but a comma ([^,]) and special a comma as long as it is not followed by two single quotes (,(?!'')). The first normal, however, is made non empty using + instead of *.
Demonstration code in perl. The string concatenation operator in perl is a dot:
fge#erwin $ cat t.pl
#!/usr/bin/perl -W
use strict;
# Value separator: a comma optionally surrounded by spaces
my $value_separator = '\s*,\s*';
# Literal "null", and a number
my $null = 'null';
my $number = '\d+';
# Text field
my $normal = '[^,]'; # Anything but a comma
my $special = ",(?!'')"; # A comma, _not_ followed by two single quotes
my $field = "'$normal+(?:$special$normal*)*'"; # a text field
# A to_date() expression
my $to_date = 'to_date\(\s*' . $field . $value_separator . $field . '\s*\)';
# Any field
my $any_field = '(' . $null . '|' . $number . '|' . $field . '|' . $to_date . ')';
# The full regex
my $full_regex = '^\s*VALUES\s*\(\s*' . $any_field . $value_separator . $any_field
. $value_separator . $any_field . '\s*\)\s*$';
# This builds a compiled form of the regex
my $re = qr/$full_regex/;
# Read from stdin, try and match (m//), if match, print the three captured groups
while (<STDIN>) {
m/$re/ and print <<EOF;
Argument 1: -->$1<--
Argument 2: -->$2<--
Argument 3: -->$3<--
EOF
}
Demonstration output:
fge#erwin ~ $ perl t.pl
VALUES ('9gfdg', to_date('1876/12/06','YYYY/MM/DD'), null)
Argument 1: -->'9gfdg'<--
Argument 2: -->to_date('1876/12/06','YYYY/MM/DD')<--
Argument 3: -->null<--
VALUES('prout', 'ma', 'chere')
Argument 1: -->'prout'<--
Argument 2: -->'ma'<--
Argument 3: -->'chere'<--
VALUES(324, 'Aiie, a comma', to_date('whatever', 'is there, even commas'))
Argument 1: -->324<--
Argument 2: -->'Aiie, a comma'<--
Argument 3: -->to_date('whatever', 'is there, even commas')<--
One thing to note: you will notice that I don't ever use any lazy quantifiers, and not even the dot!
edit: special in a field is actually a comma not followed by two single quotes, not one
If only the second parameter can have commas in it, you could do something like:
^VALUES\s*\(\s*'?([^',]*)'?\s*,\s*(.*?)\s*,\s*'?([^',]*)'?\s*\)$
Otherwise I don't know what features that regex flavor supports, so hard to make something more fun. Altho you could always make a limited depth nested parentheses regex if (?R) is not supported.
For the more general case you could try something like:
^\s*
VALUES\s*
\(
\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*,\s*
(?: '([^']*)' | ( \w+ (?: \( [^()]* \) )? ) )
\s*
\)\s*
$
Spaces removed:
^\s*VALUES\s*\(\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*,\s*(?:'([^']*)'|(\w+(?:\([^()]*\))?))\s*\)\s*$
Replace with:
\1\2 \3\4 \5\6
Should work for one nested level of parentheses without any quoted parenthesis in them.
PS: Not tested. You can usually use the spaced regex if your flavor supports the /x flag.