Flatten distinct in apache pig - apache-pig

I have a data set that looks like this:
DUMP A;
(10000,({(10000),(20000),(50000)},{(10000),(20000),(30000)}))
(20000,({(10000),(20000),(50000)},{(20000)},{(10000),(20000),(30000)}))
(30000,({(30000)},{(10000),(20000),(30000)}))
(40000,({(40000)},{(40000),(50000)}))
(50000,({(40000),(50000)},{(10000),(20000),(50000)}))
DESCRIBE A;
{foo: bytearray, bar_gp: (baz: {(foo: bytearray)})}
I eventually want it to look like this:
DUMP A;
(10000,{(10000),(20000),(50000),(30000)})
(20000,{(10000),(20000),(50000),(30000)})
(30000,{(10000),(20000),(30000)})
(40000,{(40000),(50000)})
(50000,{(40000),(50000),(10000),(20000)})
If I tried using:
B = FOREACH A GENERATE $0, FLATTEN($1);
C = FOREACH B {D = FOREACH B GENERATE FLATTEN($1); D= DISTINCT D; GENERATE $0, D; }
but I kept getting the error:
expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)
How can I get the desired output? I know I could use a UDF to parse it, but I would like to find a built-in solution.

I think you need to do distinct on the BAG before flattening it.
B = FOREACH A {
D = DISTINCT $1;
GENERATE $0, FLATTEN(D)}

Related

Convert the value of a column to uppercase in pig

I need to convert the value of a column to uppercase in pig.
Was able to do using UPPER but this creates a new column.
For example:
A = Load 'MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
Dump A;
Returns
a,b,c
d,e,f
Now I need to convert second column to upper case.
B = Foreach A generate *,UPPER(column2);
Dump B;
returns
a,b,c,B
e,f,g,F
But I need
a,B,c
e,F,g
Please let me know if there is a way to so.
I didn't tried from my side but you can try like this
B = Foreach A generate column1,UPPER(column2),column3;
Using the "*" in the below line is the reason for the extra column:
B = FOREACH A generate *, UPPER(column2);
Instead use the below:
B = Foreach A generate column1, UPPER(column2), column3;
You can do it with user define function that default provided by Apache pig
find PiggyBank Jar
command
find / -name "piggybank*.jar*"
now goto pig grunt shell
code
grunt> register /usr/local/pig-0.16.0/contrib/piggybank/java/piggybank.jar;
grunt> A = Load 'data/MyFile.txt' using PigStorage(',') as (column1:chararray, column2:chararray, column3:chararray);
grunt> dump A;
result
(a,b,c)
(d,e,f)
Now convert second column to upper case.
grunt> B = foreach A generate column1,org.apache.pig.piggybank.evaluation.string.UPPER(column2),column3;
grunt> dump B;
result
(a,B,c)
(d,E,f)

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

How to write a function in Pig?

A Pig newbie here. I have a relation A with multiple fields (f1,f2...). I want to quickly see all the distinct values that are there in each field.
Right now, I am doing this:
f1 = FOREACH A GENERATE f1;
f1 = DISTINCT f1;
dump f1;
I don't want to do this for each field. It is too elaborate. Is it possible instead to write some kind of a function in Pig to do this. I've looked at UDFs in the documentation, but I don't want to switch to another language like Java or Python. I think Pig is fine for what I am doing.
What you are looking for is a Macro. This is the equivalent to a function.
DEFINE MY_MACRO(relation,field) RETURNS selected_field_distinct {
selected_field = FOREACH $relation GENERATE $field;
$selected_field_distinct = DISTINCT selected_field;
};
A = LOAD 'input.txt' USING PigStorage(',') AS (f1:chararray, f2:chararray);
F1 = MY_MACRO(A,'f1');
F2 = MY_MACRO(A,'f2');
DUMP F1
DUMP F2
Please note that:
You have to declare the macro above the location where you use it.
You can also write your macro on a different file and import it to your script.
There is no way to use the command DUMP within the macro as described here.
A thought worth contemplating ....
If the values seen in f1 will not occur in f2 then you can try this approach. In this case we are performing DISTINCT only once.
f1 = FOREACH A GENERATE f1;
f2 = FOREACH A GENERATE f2;
...
f10 = FOREACH A GENERATE f10;
all_values = UNION f1,f2,..., f10;
uniq_values = DISTINCT all_values;
DUMP uniq_values;

Convert one line into multiple line in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
Output should be:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3
ABC,DEF,GHI,JKL,AAA,bbb,1,2,3
ABC,DEF,GHI,JKL,AAA,ccc,1,2,3
ABC,DEF,GHI,JKL,BBB,aaa,1,2,3
ABC,DEF,GHI,JKL,BBB,bbb,1,2,3
ABC,DEF,GHI,JKL,BBB,ccc,1,2,3
Could anyone please help me?
You can write your own custom UDF or try the below approach
input.txt
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,CCC,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
PigScript:
A = LOAD 'input.txt' USING PigStorage(',');
B = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($4)),
FLATTEN(TOBAG(
TOTUPLE($5..$8),
TOTUPLE($9..$12),
TOTUPLE($13..$16)
)
);
C = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($17)),
FLATTEN(TOBAG(
TOTUPLE($18..$21),
TOTUPLE($22..$25),
TOTUPLE($26..$29)
)
);
D = UNION B,C;
DUMP D
Output:
(ABC,DEF,GHI,JKL,AAA,aaa,1,2,3)
(ABC,DEF,GHI,JKL,AAA,bbb,1,2,3)
(ABC,DEF,GHI,JKL,AAA,ccc,1,2,3)
(ABC,DEF,GHI,JKL,BBB,aaa,1,2,3)
(ABC,DEF,GHI,JKL,BBB,bbb,1,2,3)
(ABC,DEF,GHI,JKL,BBB,ccc,1,2,3)

How to address fields in Pig Latin after loading

Have a large file with a lot of columns file which I am loading like
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2.. ;
C = FOREACH B FILTER BY (name is NOT NULL);
I get an error that projected field [name] does not exist? I dont want to address columns by doing $0, $1 and all that . How can I give them some identifiers ?
That pig script doesnt run for me - but changing to this :
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2 as another;
C = FILTER B BY (name is NOT NULL);
does work.
Nested FOREACH will be a better option
B=FOREACH A {
filtered_rec = FILTER A BY (name is not null);
GENERATE filtered_rec;
}