pig: getting count of strings from file A from file B - apache-pig

I am new to PIG, I am looking for help.
I have two files A(templates) and B(data) both are having huge unstructured content.
Agenda is to traverse the file B(data) and find the count against each template(line) of file A.
I think it should work in a loop with the nested statement but I do not know how can I achieve the same in pig.
example:-
file1.txt
hello ravi
hi mohit
bye sameer
hi mohit
hi abc
hello cds
hi assaad
file2.txt
hi mohit
hi assaad
I need a count of file2 both lines.
The expected output may look like:-
hi mohit: 2
hi assaad: 1
Please do let me know.

Lets start by loading both your datasets:
data = LOAD 'file1.txt' AS (line:chararray);
templates = LOAD 'file2.txt' AS (template:chararray);
Now we essentially need to JOIN the above relations on the templates. Once joined, we can GROUP on the template to get counts for each template. However that would required 2 map-reduce stage, one for the JOIN and one for the GROUP BY. Here is where you can use COGROUP. It is an extremely useful operation and you can read more about it here: https://www.tutorialspoint.com/apache_pig/apache_pig_cogroup_operator.htm
cogroupedData = COGROUP data BY line, templates BY template;
templateLines = FILTER cogroupedData BY (NOT ISEmpty(templates));
templateCounts = FOREACH templateLines GENERATE
group AS template,
COUNT(data.line) AS templateCount;
DUMP templateCounts;
What COGROUP does is essentially similar to a JOIN and then a GROUP BY on the same key (template in this case). It takes only one map-reduce stage. The filter applied above is to remove records which did not have a template in file2.txt

Related

Pig Query - Giving inconsistent results in AWS EMR

I am new to PIG. I have written one query which is not working as expected. I am trying to process Google ngrams dataset provided to me.
I load the data which is 1GB
bigrams = LOAD '$(INPUT)' AS (bigram:chararray, year:int, occurrences:int, books:int);
Then I select a subset which is limited to 2000 entries
limbigrams = LIMIT bigrams 2000;
Then see the dump of the limited data (pasting sample output)
(GB product,2006,1,1)
(GB product,2007,5,5)
(GB wall_NOUN,2007,27,7)
(GB wall_NOUN,2008,35,6)
(GB2 ,_.,1906,1,1)
(GB2 ,_.,1938,1,1)
Now I do a group by on limbigrams
D = GROUP limbigrams BY bigram;
When I see the data dump of D I see an entirely different data set (sample)
(GLABRIO .,1977,3,3),(GLABRIO .,1992,3,3),(GLABRIO .,1997,1,1),(GLABRIO .,2000,6,6),(GLABRIO .,2001,9,1),(GLABRIO .,2002,24,3),(GLABRIO .,2003,3,1)})
(GLASS FILMS,{(GLASS FILMS,1978,1,1),(GLASS FILMS,1976,2,1),(GLASS FILMS,1970,3,3),(GLASS FILMS,1966,7,1),(GLASS FILMS,1962,1,1),(GLASS FILMS,1958,1,1),(GLASS FILMS,1955,1,1),(GLASS FILMS,1899,2,2),(GLASS FILMS,1986,6,3),(GLASS FILMS,1984,1,1),(GLASS FILMS,1980,7,3)})
Now I am not attaching the entire output because there is not even a single row of overlap between both the outputs (i.e. before group-by and after group-by). Hence it really doesn't matter to see the output files.
Why does this happen?
The dumps are accurate. The GROUP BY operator in Pig creates a single record for each group and puts every record belonging to that group inside a bag. You can indeed see this in the last record of your second dump. The record stands for the group GLASS FILMS and has a bag containing records which have the bigram as GLASS FILMS. You can read more about the GROUP BY operator here: https://www.tutorialspoint.com/apache_pig/apache_pig_group_operator.htm

Pig calculating avg of delay fails

I have a file for airplanes data, having airplane dest and delay(delay can be negative or positve number)
A = load ‘flightdelays’ using Pigstorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
C = group b all; -- this is failing for cast error, also get an error failed to read data from input file..
D =foreach c generate b.dest, AVG(b.delay);
When i execute this , i get 0 records read from source file and mapreduce job failed..
Why is it not able to calculate AVG?
Check the extension/path of the file.Is your file comma separated? Also,there are plenty of case issues with your script.
PigStorage - s is small in your load statement.
A = load ‘flightdelays’ using PigStorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
There is no relation called a,b,c.You are loading data to relation A and so on.
1st thing A,a treated differently(in pig relation names are case sensitive) and 2nd thing while calculating Aggregate function on relation and group by on any attribute..
In FOREACH you should specify grouping attribute and aggregate function..
In this scenario you used group by all so you can't use b.dest along with aggregate function..
If you want destination wise AVG() delay then you should group by dest..

Pig Optimization on Group by

Lets assume that i have a large data set as per below schema layout
id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...
I have two style of pig code giving me the same output.
Style 1 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;
Style 2 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;
In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.
I Would like to reduce the total time taken to complete that pig job..
So the Style 2 code achieve that ? Or there is no impact in total time taken?
If somebody confirms me then i can run similar code in my cluster with very large dataset
The solutions will have same costs.
However if records_grp is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.

How to split a data in particular column into two other columns using pig scripts?

Hi i am working in big data ,since i am a new bee to pig programming help me to get the required output.I have a csv file which have many columns,one of the column is price,which has data like the following:
(10 Lacs)
(20 to 30 Lacs)
And i need this to be splitted as
price min max
10 null null
null 20 30
I have tried the following code
a = LOAD '/user/folder1/filename.csv' using PigStorage(',')as(SourceWebsite:chararray,PropertyType:chararray,PropertyId:chararray,title:chararray,bedroom:int,bathroom:int,Balconies:chararray,price:chararray,pricepersqft:chararray,builtuparea:chararray,address:chararray,otherdetails:chararray,description:chararray,posted:chararray,Features:chararray,ContactDetails:chararray);
b = FOREACH a GENERATE STRSPLIT(price, 'to');
c = FOREACH b GENERATE FLATTEN(STRSPLIT(Price,',')) AS (MAX:int,MIN:int);
dump c;
Any help will be appreciated.
I just ran into the same issue, and here is how I managed to solve it.
Suppose the column called outputraw.outputlineraw looks like this:
abc|def
gh|j
Then I split it into multiple columns like so:
output_in_columns = FOREACH output_raw GENERATE
FLATTEN(STRSPLIT(output_line_raw,'\\|'));
To test whether it succeeded, I dumped the result after referring to the columns:
output_selection = FOREACH output_in_columns GENERATE
$0,
$1;
DUMP output_selection;

Pig : Cogroup How to avoid Blank values

I am new to pig.
While doing a COGROUP i came across issue.
I am trying to do a COGROUP on two files. Keys which i am using for COGROUP having a null values.
Below are my input files :
Input_file_1 :
a|b||
e|f||
Input_file_2 :
a|b||
e|f||
I am using all the four columns as a key while doing a COGROUP. (Last two columns are blank)
My expected output is two records, but i am getting four records as a output.
Can anyone please help how to avoid blank values while doing a COGROUP in PIG.
Thanks in advance.
Null values are handled very differently in PIG.
As per Alan Gates, the author of Book Programming Pig says
cogroup handles null values in the keys similarly to group and unlike
join. That is, all records with a null value in the key will be collected together.
Thus the output of COGROUP would be
((a,b,,),{(a,b,,)},{})
((a,b,,),{},{(a,b,,)})
((e,f,,),{(e,f,,)},{})
((e,f,,),{},{(e,f,,)})
In your case, you have to go for JOIN instead of COGROUP. Thus giving you following result
(a,b,,,a,b,,)
(e,f,,,e,f,,)
Then generate required values.