Remove punctuation from string expect apostrophe in Pig latin script - apache-pig

I want to perform word count on a word file and remove punctuation expect for string with apostrophe. I tried doing the below code but it giving an error as unexpected " ".
word_file = LOAD '/user/username/text.txt' USING TextLoader AS(line:CHARARRAY);
stop_file = LOAD '/user/username/stop_words.txt' USING TextLoader AS(stop:CHARARRAY);
words = FOREACH word_file GENERATE FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(line)) ,'([\w\d'\s]+)', ''))) AS word;
Can anyone please help me on this?

http://pig.apache.org/docs/r0.17.0/func.html#replace
states
If you want to replace special characters such as '[' in the string
literal, it is necessary to escape them in 'regExp' by prefixing them
with double backslashes (e.g. '\[').
So if you want to exclude any quoted string, I would do
word_file = LOAD 'input.txt' USING TextLoader AS(line:CHARARRAY);
words = FOREACH word_file GENERATE
FLATTEN(
TOKENIZE(
REPLACE(LOWER(TRIM(line)),'(\\\'[\\w\\d\\s]+\\\')', ''))) AS word;
STORE words into '...';

Related

How to read TSV file without text delimiters where text contains single and double quotes

I have an input text file where fields are tab searated. Some fields contains text with single quotes (') and some fields contains text with double quotes ("). Soem fields contains both single and double quotes. Here is an example:
Theme from Bram Stoker's Dracula (From Bram Stoker's Dracula"") Soundtrack & Theme Orchestra
Is there any way to tell OPENROWSET to not try to parse the fields?
I have found that I can set the FIELDQUOTE to either a single quote or a double quote but not to both (using FIELDQUOTE = '''"' gives error Multi-byte field quote is not supported)
Here's an example of a query I try to use:
SELECT TOP 10 *
FROM OPENROWSET
(
BULK 'files/*.txt',
DATA_SOURCE = 'files',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIELDTERMINATOR = '\t',
FIELDQUOTE = ''''
)
AS r
and I can also use FIELDQUOTE = '"' but not the two at the same time...
Any suggestions on how to fix this? (without changing the source files)

Delimiter after a quoted field, how to escape quote

I have that kind of file
info1;info2;info3";info4;info5
And after parsing I have that error
Error: [42636] ETL-2106: Error while parsing row=0 (starting from 0) [CSV Parser found at byte 5 (starting with 0 at the beginning of the row) of 5 a field delimiter after an quoted field (with an additional whitespace) in file '~/path'. Please check for correct enclosed fields, valid field separators and e.g. unescaped field delimiters that are contained in the data (these have to be escaped)]
I'm sure that the reason is here info3"; but how can I solve this problem I have no idea
Also I can't rid of quotes, because it should be in report
The main part of python code is
# Transform data to valid CSV format: remove BOM, remove '=' sign, remove repeating quotes in Size column
decoded_csv = r.content.decode('utf-8').replace(u'\ufeff', '').replace('=', '')
print(decoded_csv)
cr = csv.reader(decoded_csv.splitlines(), delimiter=';')
lst = list(cr)[1:]
f = csv.writer(open(base_folder + 'txt/' + shop, "w+"), delimiter=';')
for row in lst:
f.writerow(row[:-2])
After this code I get that kind of file
info1;info2;"info3""";info4;info5
And it is not what I need
But when I change code a little by adding "quoting=csv.QUOTE_NONE, quotechar='')"
# Transform data to valid CSV format: remove BOM, remove '=' sign, remove repeating quotes in Size column
decoded_csv = r.content.decode('utf-8').replace(u'\ufeff', '').replace('=', '')
print(decoded_csv)
cr = csv.reader(decoded_csv.splitlines(), delimiter=';')
lst = list(cr)[1:]
f = csv.writer(open(base_folder + 'txt/' + shop, "w+"), delimiter=';' quoting=csv.QUOTE_NONE, quotechar='')
for row in lst:
f.writerow(row[:-2])
I get what I need
info1;info2;info3";info4;info5
It is a 2nd step (exasol) and code returned the error
MERGE INTO hst AS dst
USING (
SELECT DISTINCT
ar,
ar_na,
FROM (
IMPORT INTO
(
ar VARCHAR(100) UTF8 COMMENT IS 'ar',
ar_na VARCHAR(100) UTF8 COMMENT IS 'ar na',
)
FROM CSV /*SS:R*/
AT '&1'
USER '&2'
IDENTIFIED BY '&3'
FILE '~/path'
SKIP = 0
ROW SEPARATOR = 'CRLF'
COLUMN SEPARATOR = ';'
TRIM
)
GROUP BY
ar,
ar_na,
) src ON src.ar = dst.ar
WHEN MATCHED THEN UPDATE SET
dst.ar_na = src.ar_na,
WHEN NOT MATCHED THEN
INSERT (
ar
ar_na,
)
VALUES (
src.ar,
src.ar_na,
);
If file looks like info1;info2;info3;info4;info5 everything works fine, all scripts work
By default, Exaosl treats double quotes (") as column delimiter. This enables you to specify values that contain the column separator (in your case that's the semicolon). See the entry "Special characters" in the documentation.
You have two options here:
Disable the column delimiter by passing COLUMN DELIMITER = '' to the import statement.
Duplicate all double quotes in the csv file. Exasol ignores the column delimiter if it occurs twice consecutively.

find all occurrences of a regex as an array

have the following string (it's a salesforce query, but not important):
IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")
and I want to get an array of all substrings that are encapsulated between double quotes like so:
['yahoo',
'gmail',
'protonmail.com,att.net,chpmail.com,smail.com',
'Free Mail',
'Business Email']
in python I do:
re.findall(r'"(.+?)"', <my string>)
but is there a way to replicate this in Snowflake?
I've tried
SELECT
REGEXP_SUBSTR('IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")', '"(.+?)"') as emails;
but I get this:
"yahoo"),CONTAINS(EmailDomain,"gmail"
You can use
select split(trim(regexp_replace(regexp_replace(col, '"([^"]+)"|.', '\\1|'),'\\|+','|'), '|'), '|');
Details:
regexp_replace(col, '"([^"]+)"|.', '\\1|') - finds any strings between the closest double quotes while capturing the part inside quotes into Group 1, or matching any single char and replaces each match with Group 1 contents + | char (see the regex demo)
regexp_replace(...,'\\|+','|') - this shrinks all consecutive pipe symbols into a single occurrence of a | char (see this regex demo)
trim(..., '|') - removes | chars on both ends of the string
split(..., '|') - splits the string with a | char.
Wiktor's answer works great. I'm adding an alternate answer for anyone who needs to do this and their quoted strings may contain the pipe | character. Using the replacement method on strings containing pipe(s) will split the string into more than one array member. Here's a way (not the only way) to do it that will work in case the quoted strings could potentially contain pipe characters:
set col = $$IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),"Free Mail","Business Email | Other")$$;
create or replace function GET_QUOTED_STRINGS("s" string)
returns array
language javascript
strict immutable
as
$$
var re = /(["'])(?:\\.|[^\\])*?\1/g;
var m;
var out = [];
do {
m = re.exec(s);
if (m) {
out.push(m[0].replace(/['"]+/g, ''));
}
} while (m);
return out;
$$;
select get_quoted_strings($col);

How do I extract the "pattern" word when using Like operator?

Hi i'm trying to figure out a way to retrieve a word from the like operator.
Ex:
text = "jsoihj a125847 asf"
Dim s as String = text Like "*a######*"
I would like 's' to equal the actual word that has the pattern of " * a###### * " instead of it returning True
As stated in the above comments Like will not give you the string. You could parse the string character by character to find the pattern but this is a natural job for Regex. I am no expert here so I use sites like RegExr to hack out and test the match string.
dim s as string = Regex.Match(text, "([A][0-9])\w+").Value

Search an Oracle clob for special characters that are not escaped

Is it possible to run a query that can search an Oracle clob for any record that contains an ampersand character where the word in which the character is located in is not one of any of the following (or possible any escape code):
& - &
< - <
> - >
" - "
' - &apos;
I want to extract 5 character before the ampersand and 5 characters after the ampersand so i can see the actual value.
Basically i want to search for any record that contains those fields and replace it with the escape code.
At the moment i am doing something like this:
Select * from articles
where dbms_lob.instr(article_summary , '&amp' ) = 0 and dbms_lob.instr(article_summary , '&' )
Update
If i was to use a regular expression, how would i specify it if i want to retrieve all fields where the value is & followed by any character other than 'a'?
You can use DBMS_XMLGEN.CONVERT for this. The second parameter is optional and if left out will escape the the XML special characters.
select DBMS_XMLGEN.CONVERT(article_summary)
from articles;
But, if article summary contains a mixture of escaped and unescaped characters, then this will give wrong result. Easiest way to solve it, is to unescape the characters first and then escape it.
select DBMS_XMLGEN.CONVERT(
DBMS_XMLGEN.CONVERT(article_summary,1) --1 as parameter does unescaping
)
from articles;