I have a seemingly simple Pig generate and then filter issue - apache-pig

I am trying to run a simple Pig script on a simple csv file and I can not get FILTER to do what I want. I have a test.csv file that looks like this:
john,12,44,,0
bob,14,56,5,7
dave,13,40,5,5
jill,8,,,6
Here is my script that does not work:
people = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filtered = FILTER data BY first == 13;
DUMP filtered;
When I dump data, everything looks good. I get the name and the first and last integer as expected. When I describe the data, everything looks good:
data: {name: bytearray,first: int,second: int}
When I try and filter out data by the first value being 13, I get nothing. DUMP filtered simply returns nothing. Oddly enough, if I change it to first > 13, then all "rows" will print out.
However, this script works:
peopletwo = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',') AS (f1:chararray,f2:int,f3:int,f4:int,f5:int);
datatwo = FOREACH peopletwo GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filteredtwo = FILTER datatwo BY first == 13;
DUMP filteredtwo;
What is the difference between filteredtwo and filtered (or data and datatwo for that matter)? I want to know why the new relation obtained using GENERATE (i.e. data) won't filter in the first script as one would expect.

Specify the datatype in the load itself.See below
people = LOAD 'test5.csv' USING PigStorage(',') as (f1:chararray,f2:int,f3:int,f4:int,f5:int);
filtered = FILTER people BY f2 == 13;
DUMP filtered;
Output
Changing the filter to use > gives
filtered = FILTER people BY f2 > 13;
Output
EDIT
When converting from bytearray you will have to explicitly cast the value of the fields in the FOREACH.This works.
people = LOAD 'test5.csv' USING PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray,(int)$1 AS f1,(int)$4 AS f2;
filtered = FILTER data BY f1 == 13;
DUMP filtered;

Related

How to filter out first line from an variable in pig

I imported a cvs file to an variable like below:
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
below is the output of the first 3 lines:
tmp = limit basketball_players 3;
dump tmp
("playerID","year","stint","tmID","lgID","GP","GS","minutes","points","oRebounds","dRebounds","rebounds","assists","steals","blocks","turnovers","PF","fgAttempted","fgMade","ftAttempted","ftMade","threeAttempted","threeMade","PostGP","PostGS","PostMinutes","PostPoints","PostoRebounds","PostdRebounds","PostRebounds","PostAssists","PostSteals","PostBlocks","PostTurnovers","PostPF","PostfgAttempted","PostfgMade","PostftAttempted","PostftMade","PostthreeAttempted","PostthreeMade","note")
("abramjo01","1946","1","PIT","NBA","47","0","0","527","0","0","0","35","0","0","0","161","834","202","178","123","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)
("aubucch01","1946","1","DTF","NBA","30","0","0","65","0","0","0","20","0","0","0","46","91","23","35","19","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)
you can see that the first line is the header of the table. I use below command to filter out the first line but it doesn't work.
grunt> players_raw = filter basketball_players by $1 > 0;
2017-05-06 11:03:36,389 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 6 time(s).
when I dump the value of players_raw it returns empty. How can I filter the first row out from an variable?
Use RANK to generate a new column that will add row numbers to the dataset.Use that column to filter the first row.
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
ranked = rank basketball_players;
basketball_players_without_header = Filter ranked by (rank_basketball_players > 1);
DUMP basketball_players_without_header;
Another way to do this
basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
basketball_players_without_header = Filter basketball_players by ($0 matches '.*playerID.*');
DUMP basketball_players_without_header;

Apache Pig reading name value pairs in data file

i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;

How to get the number of words per line in pig?

I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?
if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.
Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)

Pig Aggregrate functions

My Input file is below
a,t1,1000,100
a,t1,2000,200
b,t2,1000,200
b,t2,5000,100
How to find count of distinct $0 in the above file.
myinput = LOAD 'file' AS(a1:chararray,a2:chararray,amt:int,rate:int);
After the above script what needs to done.
Also Can I use that distinct count for dividing some other is a different relation
First of all, the way you read the data is incorrect. If you try to dump "myinput", youll see that the whole row is read in the first field (a1), while the others are empty.
The reason is that you don't specify a LOAD function, and a default function is the PigStorage() built-in function which expects tab-delimited file (so it ignores your commas!).You need to explicitly specify a load function (e.g. PigStorage()) via the using clause and pass it arguments:
myInput = LOAD file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
Moving on, to find the DISTINCT $0 first you have to extract field $0 in a separate relation. The reason is that the DISTINCT statement works on entire records, rather than on separate fields.
myField = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT myField;
Now the result of distinctA1 is {(a), (b)}. By using now group all, you will group together all of your records together, and then what is left is to COUNT them:
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
And now you're happy. :)
The complete code:
myInput = LOAD 'file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);
a1 = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT a1;
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
You can do something like this :
myInput = LOAD 'file.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray,amt:int,rate:int);
Data = GROUP myInput BY $0;
Data = FOREACH Data GENERATE $0;
Data = GROUP Data ALL;
Data = FOREACH Data GENERATE $0,COUNT($1);
NB: By Grouping on $0 you are doing the same thing as a distinct and you get better performance ;)

Getting a count of a particular string

How can I count the number of occurances of a particular string,say 'Y' , for each individual row and do calculations on that count after that. For ex. how can I find the number of 'Y' for each 'FMID' and do calculations on that count for each FMID ?Dataset Screenshot
You could use TOKENIZE built-in function which converts row into a BAG and than use nested filtering in the foreach to get a BAG that contains only word you are interested in, on which you can use COUNT. See FOREACH description
For example:
inpt = load '....' as (line : string);
row_bags = foreach inpt generate line, TOKENIZE(line) as word;
cnt = foreach row_bags {
match_1 = filter word by 'Y';
match_2 = filter word by 'X';
generate line, COUNT(match_1) as count_1, COUNT(match_2) as count_2;
};
dump cnt;
Using some functions from DataFu library you could get count for each string in the BAG word.
Here's the simplest way I can think of solving your problem:
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS
(fmid:int, field1:chararray, field2:chararray,
field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
I'm not really certain about the data schema you're working with. I'm going to assume you have a relation in Pig consisting of a sequence of tuples. I'm also going to assume you have a lot of fields, making it a pain to reference each individually.
I'll walk through this example piece by piece to explain. Without loss of generality, I will use this data below for my example:
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
Now that we've loaded the data, we want to transpose each tuple to a bag, because once it is in a bag we can perform counts on the items within it more easily. We'll use TransposeTupleToBag from the DataFu library:
define Transpose datafu.pig.util.TransposeTupleToBag();
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
Note that you need at least Pig 0.11 to use this UDF. Note the use of $1.., which is known as a project-range expression in Pig. If you have many fields this is really convenient.
If you were to dump data2 at this point you would get this:
(1000,{(field1,N),(field2,N),(field3,N),(field4,N)})
(2000,{(field1,N),(field2,Y),(field3,N),(field4,N)})
(3000,{(field1,Y),(field2,Y),(field3,N),(field4,N)})
(4000,{(field1,Y),(field2,Y),(field3,Y),(field4,Y)})
What we've done is taken the fields from the tuple after the 0th element (fmid), and tranposed these into a bag where each tuple has a key and value field.
Now that we have a bag we can do a simple filter and count:
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
Now if you dump data2 you get the expected counts reflecting the number of Y values in the tuple.
(1000,0)
(2000,1)
(3000,2)
(4000,4)
Here is the full source code for my example as a unit test, which you can put directly in the DataFu unit tests to try out:
/**
register $JAR_PATH
define Transpose datafu.pig.util.TransposeTupleToBag();
data = LOAD 'input' USING PigStorage(',') AS (fmid:int, field1:chararray, field2:chararray, field3:chararray, field4:chararray);
data2 = FOREACH data GENERATE fmid, Transpose($1..) as fields;
dump data2;
data2 = FOREACH data2 {
y_fields = FILTER fields BY value == 'Y';
GENERATE fmid, SIZE(y_fields) as y_cnt;
}
dump data2;
STORE data2 INTO 'output';
*/
#Multiline
private String example;
#Test
public void example() throws Exception
{
PigTest test = createPigTestFromString(example);
writeLinesToFile("input",
"1000,N,N,N,N",
"2000,N,Y,N,N",
"3000,Y,Y,N,N",
"4000,Y,Y,Y,Y");
test.runScript();
super.getLinesForAlias(test, "data2");
}