Openrefine rearange value - openrefine

In a csv column I have this data:
My Dog (101)
ACat(f023.12)
My Dog (101)
ACat ad
I like to rearrange them like:
101, My Dog ()
f023.12, ACat()
101, My Dog ()
To match them I could use a simple regex like (.* ?)\((.*)\) (the last row will be kept untouched) https://regex101.com/r/ivrIa3/1
Is there an easier way doing this as:
if(value.contains(/(.* ?)\((.*)\)/), value.match(/(.* ?)\((.*)\)/)[1] + ', ' + value.match(/(.* ?)\((.*)\)/)[0], value)

In OpenRefine, the easiest way would be to use a facet (like the « Text filter ») to select lines that contains (…).
Then, use the Column command « Edit cells -> Replace ».
Find: (.*)\s*\((.*)\)
Replace: $1, $2
Regards,
Antoine

Related

Parse string as JSON with Snowflake SQL

I have a field in a table of our db that works like an event-like payload, where all changes to different entities are gathered. See example below for a single field of the object:
'---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc'
Since accessing this field with pure SQL is a pain, I was thinking of parsing it as a JSON so that it would look like this:
{
"field_one":"1",
"field_two": "20",
"field_three": "4",
"id": "1234",
"another_id": "5678",
"some_text": "Hey you",
"a_date": "2022-11-29",
"utc": "2022-11-29 15:29:28.159296000 Z",
"another_date": "2022-11-30",
"utc": "2022-11-30 13:34:59.000000000 Z"
}
And then just use a Snowflake-native approach to access the values I need.
As you can see, though, there are two fields that are called utc, since one is referring to the first date (a_date), and the second one is referring to the second date (another_date). I believe these are nested in the object, but it's difficult to assess with the format of the field.
This is a problem since I can't differentiate between one utc and another when giving the string the format I need and running a parse_json() function (due to both keys using the same name).
My SQL so far looks like the following:
select
object,
replace(object, '---\n', '{"') || '"}' as first,
replace(first, '\n', '","') as second_,
replace(second_, ': ', '":"') as third,
replace(third, ' ', '') as fourth,
replace(fourth, ' ', '') as last
from my_table
(Steps third and fourth are needed because I have some fields that have extra spaces in them)
And this actually gives me the format I need, but due to what I mentioned around the utc keys, I cannot parse the string as a JSON.
Also note that the structure of the string might change from row to row, meaning that some rows might gather two utc keys, while others might have one, and others even five.
Any ideas on how to overcome that?
Replace only one occurrence with regexp_replace():
with data as (
select '---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc' o
)
select parse_json(last2)
from (
select o,
replace(o, '---\n', '{"') || '"}' as first,
replace(first, '\n', '","') as second_,
replace(second_, ': ', '":"') as third,
replace(third, ' ', '') as fourth,
replace(fourth, ' ', '') as last,
regexp_replace(last, '"utc"', '"utc2"', 1, 2) last2
from data
)
;
This may not be what you want but it seems to me that your problem could be solved if the UTC timestamps were to replace the dates preceding it where the keys are not duplicated. You can always calculate dates once you have the timestamps. If this is making sense, see if you can apply your parse_json solution to this output instead
set str='---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: 2022-11-29 15:29:28.159296000 Z\nanother_date: 2022-11-30\nutc: 2022-11-30 13:34:59.000000000 Z';
select regexp_replace($str,'[0-9]{4}-[0-9]{2}-[0-9]{2}\nutc:')

Alternate workaround for Lazy regular expression in snowflake since this feature is not available in snowflake

I am trying to parse the "name" and "address" from a string. I have written the regex pattern which works perfectly fine (I verified in regex101.com) with the 'ungreedy/lazy' feature of regex but not with the greedy. Here is my snowflake query:
select
TRIM(REGEXP_SUBSTR(column1,'(^\\D*)((\\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\\d+.*)|(\\d+.*))[,.\\s]+([a-zA-Z]{2})[,.\\s]+(\\d{5}|\\d{5}-\\d{4})$',1,1,'is',1)) as test
from values(TRIM('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'));
--please ignore the latter part of regex as I am fetching territory code and zip code also and they are working fine.
The above query is returning me "FIRST SECOND THIRD PO BOX"
And, if I return the 2nd group it returns me "123 DUMMY"
What I want:
case 1 - when my string is 'FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD"
output of 2nd group: "PO BOX 123 DUMMY"
case 2 - WHEN my string is 'FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD FOURTH FIFTH"
output of 2nd group: "123 DUMMY"
Please suggest workaround here in snowflake since it doesn't have lazy feature.
PS. If you want to verify in regex101, paste the below code and test string. You will see the result when you switch to Ungreedy.
(^\D*)((\bP[OST][ .]O[FFICE][ .]B[OX][ .]\d+.)|(\d+.))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$
Test String: FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789
Thanks
Writing a JavaScrip UDF is always an option, and then you can use your regex unchanged:
create or replace function parse_address(F STRING)
returns VARIANT
language JAVASCRIPT
immutable
as $$
const regex = /(^\D*)((\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\d+.*)|(\d+.*))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$/gm;
let m = regex.exec(F);
return [m[1], m[2]];
$$;
Usage:
select parse_address($1)
from values('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789')
, ('FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789')
;

I want to split the string and keep the first word only

I've a dataframe which contains details of cars. Now I want keep only the brand name and remove the model name.
I've tried using the str.split function to separate the car name. However it gives me a list and then I'm not able to extract the first name.
splitted = df['CarName'].str.split(' ',1)
Expected result:
alfa-romero
Audi
VW
Acutal result:
[alfa-romero, giulia]
[alfa-romero, stelvio]
[alfa-romero, Quadrifoglio]
[audi, 100 ls]
[audi, 100ls]
you can do in two ways, one as WeNYoBen explained in his comment, or by using extract against a list of Brands
df['brand'] =df['cars'].str.split(' ',1).str[0]
or
pattern =['audi', 'alfa-romero']
df['brand_2'] =df['cars'].str.extract("(" + "|".join(pattern) +")", expand=False)
Then you can do
splitted = df['CarName'].str.split(' ',1).str[0]
This could be achieved using pandas.DataFrame.apply with str.split
df['res']= df['CarName'].apply(lambda x : str(x).split(' ')[0])

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

Regex capturing inside a group

I working on a method to get all values based on a SQL query and then scape them in php.
The idea is to get the programmer who is careless about security when is doing a SQL query.
So when I try to execute this:
INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)
The regex needs to capture 'a' 'b' 'c' a and b
I was working on this a couple of days.
This was as far I can get with 2 regex querys, but I want to know if there is a better way to do:
VALUES ?\((([\w'"]+).+?)\)
Based on the previous SQL this will match:
VALUES ('a','b','c',a,b)
The second regex
['"]?(\w)['"]?
Will match
a b c a b
Previously removing VALUES, of course.
This way will match a lot of the values I gonna insert.
But doesn't work with JSON for example.
{a:b, "asd":"ads" ....}
Any help with this?
First, I think you should know that SQL support many types of single/double quoted string:
'Northwind\'s category name'
'Northwind''s category name'
"Northwind \"category\" name"
"Northwind ""category"" name"
"Northwind category's name"
'Northwind "category" name'
'Northwind \\ category name'
'Northwind \ncategory \nname'
to match them, try with these patterns:
"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"
'[^\\']*(?:(?:\\.|'')[^\\']*)*'
combine patterns together:
VALUES\s*\(\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+))*\)
PHP5.4.5 sample code:
<?php
$pat = '/\bVALUES\s*\((\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+))*)\)/';
$sql_sample1 = "INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)";
if( preg_match($pat, $sql_sample1, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n\n", $matches[1]);
}
$sql_sample2 = 'INSERT INTO tabla (a, b,c,d) VALUES (\'a\',\'{a:b, "asd":"ads"}\',\'c\',a,b)';
if( preg_match($pat, $sql_sample2, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n", $matches[1]);
}
?>
output:
VALUES ('a','b','c',a,b)
'a','b','c',a,b
VALUES ('a','{a:b, "asd":"ads"}','c',a,b)
'a','{a:b, "asd":"ads"}','c',a,b
If you need to get each value from result, split by , (like parsing CSV)
I hope this will help you :)