EqualsIgnoreCase function - Exception : org.apache.pig.backend.executionengine.ExecException
Input :
a.csv
-------
a
A
(blank/empty line)
b
B
c
C
Objective : To select the records which are 'a', 'A', 'b' and 'B'.
Approach 1 :
A = LOAD 'a.csv' using PigStorage(',') AS (value:chararray);
B = FILTER A BY LOWER(value) IN ('a','b');
DUMP B;
Output :
(a)
(A)
(b)
(B)
Approach 2 :
C = FILTER A BY EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b');
Output :
2015-04-27 23:48:21,958 [Thread-30] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0014
org.apache.pig.backend.executionengine.ExecException
at org.apache.pig.builtin.EqualsIgnoreCase.exec(EqualsIgnoreCase.java:50)
Trying to understand why this exception is getting thrown. I understand that its because of the blank record.
Tried checking for value NOT being null or empty, still the same error.
D = FILTER A BY (value IS NOT NULL) OR (TRIM(value) != '') AND (EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b'));
Any inputs/ thoughts on achieving our objective using Approach 2 is much appreciated.
Yes you are right, string functions EqualsIgnoreCase and TRIM are not able to handle blank string in the input.
To solve this issue,what ever you did in the last stmt is right, just remove the Trim function it will work.
C = FILTER A BY (value is not null) and (EqualsIgnoreCase(value, 'a') or EqualsIgnoreCase(value, 'b'));
Is not null condition will take care of empty(null, space and tab) chars, so TRIM function is not required.
Related
I'm trying to implement the behavior of selecting data based on either an array of input, or get all data if array is null or empty.
SELECT * FROM table_name
WHERE
('{}' = $1 OR col = ANY($1))
This will return pq: op ANY/ALL (array) requires array on right side.
If I run
SELECT * FROM table_name
WHERE
(col = ANY($1))
This works just fine and I get the contents I expected.
I can also use array_length but it will request me to assert what type of data is in $1. If I do (array_length($1::string[],1) < 1 OR col = ANY($1)), it seems to always return false on the array_length and go on to the col = ANY($1)
How can I return either JUST the values from $1 OR all if $1 is '{}' or NULL?
Got it:
($1::string[] IS NULL OR event_id = ANY($1))
I have got a big CSVs that contain big strings. I wanna parse them in U-SQL.
#t1 =
SELECT
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)") AS p
FROM
(VALUES(1)) AS fe(n);
#t2 =
SELECT
p.Groups["ID"].Value AS gads_id,
p.Groups["T"].Value AS gads_t,
p.Groups["S"].Value AS gads_s
FROM
#t1;
OUTPUT #t
TO "/inhabit/test.csv"
USING Outputters.Csv();
Severity Code Description Project File Line Suppression State
Error E_CSC_USER_INVALIDCOLUMNTYPE:
'System.Text.RegularExpressions.Match' cannot be used as column type.
I know how to do it in a SQL way with EXPLODE/CROSS APPLY/GROUP BY. But may be it is possible to do without these dances?
One more update
#t1 =
SELECT
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["ID"].Value AS id,
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["T"].Value AS t,
Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["S"].Value AS s
FROM
(VALUES(1)) AS fe(n);
OUTPUT #t1
TO "/inhabit/test.csv"
USING Outputters.Csv();
This wariant works fine. But there is a question. Will the regex evauated 3 times per row? Does exists any chance to hint U-SQL engine - the function Regex.Match is deterministic.
You should probably be using something more efficient than Regex.Match. But to answer your original question:
System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.
Thus you would need to convert it into a built-in type, such as string or SqlArray<string> or wrap it into a udt that provides an IFormatter to make it a user-defined type.
Looks like it is better to use something like this to parse the simple strings. Regexes are slow for the task and if i will use simple string expressions (instead of CLR calls) they probably will be translated into c++ code at codegen phase... and .net interop will be eliminated (i'm not sure).
#t1 =
SELECT
pv.cust_gads != null ? new SQL.ARRAY<string>(pv.cust_gads.Split(':')) : null AS p
FROM
dwh.raw_page_view_data AS pv
WHERE
pv.year == "2017" AND
pv.month == "04";
#t3 =
SELECT
p != null && p.Count == 3 ? p[0].Split('=')[1] : null AS id,
p != null && p.Count == 3 ? p[1].Split('=')[1] : null AS t,
p != null && p.Count == 3 ? p[2].Split('=')[1] : null AS s
FROM
#t1 AS t1;
OUTPUT #t3
TO "/tmp/test.csv"
USING Outputters.Csv();
I have data in a column 'fruits' like this:
apple/green/
apple/red/
apple/brown
what i need to do is remove the '/' character at the end in rows 1 and 2. No change needs to be done in the 3rd row. My output should be
apple/green
apple/red
apple/brown
I have tried doing this..
b = foreach a generate (fruits), ENDSWITH(fruits,'/')==true ? REPLACE(SUBSTRING(fruits, (INT)LAST_INDEX_OF(fruits, '/'), (INT)SIZE(fruits)),'');
Basically I am trying to replace the '/' symbol with space ' ' in the ending.
But i am getting error with this command. Can anyone please help?
Bincond operator has this synthax:
(condition ? value_if_true : value_if_false)
Therefore, else part is mondatory and write like this :
b = foreach a generate (fruits), ENDSWITH(fruits,'/')? REPLACE(SUBSTRING(fruits, (INT)LAST_INDEX_OF(fruits, '/'), (INT)SIZE(fruits)),'') :fruits ;
Or more easier, think about using REPLACE function :
b = foreach a generate REPLACE(fruits,'[/]$','');
I have 2 data sets on which I am trying to find the difference. I am aware that there are other ways to do the same. What I am interested in is why this snippet of code is failing.
A = LOAD 'raw.people1' using org.apache.hive.hcatalog.pig.HCatLoader();
B = LOAD 'raw.people2' using org.apache.hive.hcatalog.pig.HCatLoader();
C = COGROUP A BY (name, place, animal, thing) , B BY (name, place, animal, thing) ;
D = FOREACH C DIFF(A, B);
A, B and C work correctly. But D fails with the error:
Failed to parse: Syntax error, unexpected symbol at or near 'DIFF'
Now this should not be the case. The pig docs (http://pig.apache.org/docs/r0.9.1/func.html#diff) state the DIFF takes two pags as params and A and B are bags of tuples.
What am I missing here?
Thanks
You missed GENERATE keyword before DIFF stmt, that is the reason for this error. Can you change like this?
D = FOREACH C GENERATE DIFF(A, B);
I working on a method to get all values based on a SQL query and then scape them in php.
The idea is to get the programmer who is careless about security when is doing a SQL query.
So when I try to execute this:
INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)
The regex needs to capture 'a' 'b' 'c' a and b
I was working on this a couple of days.
This was as far I can get with 2 regex querys, but I want to know if there is a better way to do:
VALUES ?\((([\w'"]+).+?)\)
Based on the previous SQL this will match:
VALUES ('a','b','c',a,b)
The second regex
['"]?(\w)['"]?
Will match
a b c a b
Previously removing VALUES, of course.
This way will match a lot of the values I gonna insert.
But doesn't work with JSON for example.
{a:b, "asd":"ads" ....}
Any help with this?
First, I think you should know that SQL support many types of single/double quoted string:
'Northwind\'s category name'
'Northwind''s category name'
"Northwind \"category\" name"
"Northwind ""category"" name"
"Northwind category's name"
'Northwind "category" name'
'Northwind \\ category name'
'Northwind \ncategory \nname'
to match them, try with these patterns:
"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"
'[^\\']*(?:(?:\\.|'')[^\\']*)*'
combine patterns together:
VALUES\s*\(\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+))*\)
PHP5.4.5 sample code:
<?php
$pat = '/\bVALUES\s*\((\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+))*)\)/';
$sql_sample1 = "INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)";
if( preg_match($pat, $sql_sample1, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n\n", $matches[1]);
}
$sql_sample2 = 'INSERT INTO tabla (a, b,c,d) VALUES (\'a\',\'{a:b, "asd":"ads"}\',\'c\',a,b)';
if( preg_match($pat, $sql_sample2, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n", $matches[1]);
}
?>
output:
VALUES ('a','b','c',a,b)
'a','b','c',a,b
VALUES ('a','{a:b, "asd":"ads"}','c',a,b)
'a','{a:b, "asd":"ads"}','c',a,b
If you need to get each value from result, split by , (like parsing CSV)
I hope this will help you :)