Schema in Pig after Flattening the String

Schema in Pig after Flattening the String - schema

I am having the problem in implementing the field object ( schema) after flatting the string in pig. I have the following code:
Data = load 'data/*.txt' using PigStorage( ) AS(...., date:chararray, .....);
B = foreach Data FLATTEN(REGEX_EXTRACT_ALL(date, '"(.)/(.)/(.*)
(.):(.):(.*)"')) AS (month:int, day:int, year:int, hour:int, min:int, second:int);
--B = filter B by year==2015;
--B = filter B by month ==1 OR month ==2;
C = foreach B generate speed, month, day, year, hour, min;
store C into 'data/out_files' using PigStorage(',');
Where date is in the form ( '2/23/2015 23:56:49')
This works perfectly fine. But when I use filter in B ( year ==2015 or month ==1 OR month ==2), this code does not work. Do you have a good idea how to use any field after flattening String?. Thank you for your help.

Can you try this?
input:
2/23/2015 23:56:49
1/23/2014 23:56:49
9/23/2014 23:56:49
8/23/2014 23:56:49
PigScript:
A = LOAD 'input' AS (date:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '([0-9]+)/([0-9]+)/([0-9]+)\\s+([0-9]+):([0-9]+):([0-9]+)')) AS (month,day,year,hour,min,second);
C = FILTER B BY (month==1) OR (month==2) OR (year==2015);
D = FOREACH C GENERATE month,day,year,hour,min,second;
DUMP D;
Output:
(2,23,2015,23,56,49)
(1,23,2014,23,56,49)
The below Regex also works.
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '(\\d+)/(\\d+)/(\\d+)\\s+(\\d+):(\\d+):(\\d+)')) AS (month,day,year,hour,min,second);

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.

The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you

Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;

Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Count distinct values in a group using pig

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.
Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:
a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;
c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;
dump c;
The output from the alias 'c' is:
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)
I want in the end:
catA, 2014, {(apple, 2), (milk, 2)}
catA, 2015, {(milk, 1)}
catB, 2014, {(apple, 1), (milk, 1)}
I've seen some example of generating value counts, but grouping by category and year is tripping me up.

Input:
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .
The below code will help you
Pig Script:
list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);
list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;
list_grp = GROUP list_each BY (category,my_year);
list_nested_each = FOREACH list_grp
{
list_inner_each = FOREACH list_each GENERATE food;
list_inner_dist = DISTINCT list_inner_each;
GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;
};
dump list_nested_each;
Output:
(catA,2014,2)
(catA,2015,1)
(catB,2014,2)

Appending to the code in the question:
d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;
will produce:
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

Pig Grouping Functions

I would like to get ,what item was bought very recently by each person. Assume that a same person can buy many items.
below are the input details
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
I need the output as below
kumar,2014-09-30,television
Andrew,2014-06-21,camera
I wrote a Pig script upto this, but after that i dont know how to proceed,can somebody help me
A = LOAD 'records.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,MAX(A.date);
But i need to get the item that was purchased recently by each person. How do i get that. If i apply GROUP then i am supposed to use only aggregate function in Pig.
How do i get the recepective item that was purchased?

Use bags and order by in a nested foreach, it will use only 1 MR job and is more in Apache Pig style.
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B {
ordered = ORDER A BY date DESC; -- this will cause secondary sort to optimise the execution
latest = LIMIT ordered 1;
GENERATE FLATTEN(latest); - advantage of PIG, that all columns are preserved and not dropped as on SQL group by
};
DUMP C;
Also use of $0, $1 etc is convenient, but imagine you have a script with hundreds of lines and tens of group by and join operations that project using '$', it is nightmare to understand the flow of information/columns though such scripts. Time wasted in maintenance and making changes to such scripts is huge.

I hope this works for you.
input.txt
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,FLATTEN(MAX($1.date));
D = JOIN A BY date,C BY $1;
E = FOREACH D GENERATE $0,$1,$2;
DUMP E;
Output:
(Andrew,2014-06-21,camera)
(kumar,2014-09-30,television)

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.

PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Schema in Pig after Flattening the String - schema

Related

PIG need to find max

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Count distinct values in a group using pig

Pig Grouping Functions

Pig split and join

Categories

Resources