Pig function to read characters after a separator - apache-pig

This is my input file
a1,hello.VDF
a2,rim.VIM
a3.dr.VDD
I need output as below
a1,VDF
a2,VIM
a3,VDD
My script is the following:
myinput = LOAD 'file' USING PigStorage(',')
AS(t1:chararray,t2:chararray); foreached= FOREACH myinput GENERATE
t1,SUBSTRING(t2,INDEXOF(t2,'.',1),SIZE(t2));
It's throwing some error. Please help

Try this:
output = foreach myinput generate ((t1 matches '(.*)\\.(.*)'?SUBSTRING(t1, 0, 2):t1), (t1 matches '(.*)\\.(.*)'?SUBSTRING(t1, INDEXOF(t1,'.',0)+1, (int)SIZE(t1)):t2));

SIZE returns long, but SUBSTRING takes integers, so you need to do conversion:
foreached =
FOREACH myinput GENERATE t1,SUBSTRING(t2,INDEXOF(t2,'.',1)+1,(int)SIZE(t2));

Related

BigQuery UDF to remove accents/diacritics in a string

Using this javascript code we can remove accents/diacritics in a string.
var originalText = "éàçèñ"
var result = originalText.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(result) // eacen
If we create a BigQuery UDF it does not (even with double \).
CREATE OR REPLACE FUNCTION project.remove_accent(x STRING)
RETURNS STRING
LANGUAGE js AS """
return x.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
""";
SELECT project.remove_accent("éàçèñ") --"éàçèñ"
Any thoughts on that?
Consider below approach
select originalText,
regexp_replace(normalize(originalText, NFD), r"\pM", '') output
if applied to sample data in your question - output is
You can easily wrap it with SQL UDF if you wish

Identifying columns through PiG

I have data set like below :
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
What separator I should be using in this case to separate out above 3 columns.
First column value is => Column,1A
Second column value is => Column2A
Third column value is => Column3A
Let be try my code:
a = LOAD '/home/hduser/pig_ex' USING PigStorage(',') AS (col1,col2,col3,col4);
b = FOREACH a GENERATE REGEX_EXTRACT(col1, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(col2, '^(.*)\\"', 1) AS (modsecondcol),col3,col4;
c = foreach b generate CONCAT($0, CONCAT(', ', $1)), $2 , $3;
dump c;
I am able to resolve it using the below steps:
Input:-
"column,1A",column2A,column3A
"column,1B",column2B,column3B
"column,1C",column2C,column3C
"column,1D",column2D,column3D
PiG Script :-
A = load '/home/hduser/pig_ex' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\,',4)) AS (firstcol:chararray,secondcol:chararray,thirdcol:chararray,forthcol:chararray);
C = FOREACH B GENERATE REGEX_EXTRACT(firstcol, '^\\"(.*)', 1) AS (modfirstcol),REGEX_EXTRACT(secondcol, '^(.*)\\"', 1) AS (modsecondcol),thirdcol,forthcol;
D = FOREACH C GENERATE CONCAT(modfirstcol,',',modsecondcol),thirdcol,forthcol;
DUMP D;
Output :-
(column,1A,column2A,column3A)
(column,1B,column2B,column3B)
(column,1C,column2C,column3C)
(column,1D,column2D,column3D)
Please let me know if there is any better way

Regex to extract first part of string in Apache Pig

I need to extract post code district from the input data below
AB55 4
DD7 6LL
DD5 2HI
My Code
A = load 'data' as postcode:chararray;
B = foreach A {
code_district = REGEX_EXTRACT(postcode,'<SOME EXP>',1);
generate code_district;
};
dump B;
Output should look like
AB55
DD7
DD5
what should be the regular expression to extract the first part of the string?
Can you try the below Regex?
Option1:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'(\\w+).*',1);
DUMP code_district;
Option2:
A = LOAD 'input' as postcode:chararray;
code_district = FOREACH A GENERATE REGEX_EXTRACT(postcode,'([a-zA-Z0-9]+).*',1);
DUMP code_district;
Output:
(AB55)
(DD7)
(DD5)

Using Aggregate functions in Pig

My input file is below
a1,1,on,400
a1,2,off,100
a1,3,on,200
I need to add $3 only if $2 is equal to "on".I have written script as below, after that I don't know how to proceed. For adding $3 only I need to apply some filter. for adding $1 there is no filter at all
Can someone help me on finishing this.
myinput = LOAD 'file' USING PigStorage(',') AS(id:chararray,flag:chararray,amt:int)
grouped = GROUP myinput BY id
I need output as below
a1, 6,600
Here is a possible solution,
You could do something like this (not tested) :
myinput = LOAD 'file' USING PigStorage(',');
A = FOREACH myinput GENERATE $0 as id, $1 as first_sum, (($2 == 'on') ? $3 : 0) as second_sum;
grouped = GROUP A BY id;
RESULT = FOREACH grouped GENERATE group as id, SUM($1.first_sum), SUM($1.second_sum);
That should do the trick
Try this
myinput = LOAD '/home/gopalkrishna/PIGPRAC/pig-sum.txt' using PigStorage(',') as (name:chararray,num:int,stat:chararray,amt:int);
A = GROUP myinput BY name;
B = FOREACH A GENERATE group, SUM(myinput.num),SUM(myinput.amt);
STORE B INTO 'SUMOUT';

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.
The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.
To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example
Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)