Filter out city, year and temp using pig? - apache-pig

Record :
Pune,2007,31.5
Pune,2007,30.5
Pune,2008,34.5
Blre,2009,13.0
Blre,2009,10.5
Script which I'm using :
grunt> A = LOAD '/home/cloudera/temp' using PigStorage(',') AS (city:chararray,year:int,temp:double);
grunt> B = group A by city;
grunt> C = FOREACH B GENERATE group, MAX(A.temp);
Output:
Pune, 34.5
Blre, 13.0
Expected Output:
Pune, 2007, 31.5
Pune, 2008, 34.5
Blre, 2009, 13.0
How can I achieve this result, thanks in advance.

Group by city and year.
A = LOAD '/home/cloudera/temp' using PigStorage(',') AS (city:chararray,year:int,temp:double);
B = group A by (city,year);
C = FOREACH B GENERATE FLATTEN(group) AS (city,year), MAX(A.temp);

Related

Mimic the Excel LookUp function in SQL

I want to mimic what a the Excel lookup function do but in SQL:
LOOKUP(recAge+1, 'reference'!refAge, 'reference'refVal')
I have 2 tables, records with a recAge column, and Gender which is the "reference" table and it has three columns: refAge, refVal and Gender.
What I want to do is get the refValue from the "reference" table where:
'refAge' == 'recAge'+1 and if that is not applicable then I should get the smallest value close to it.
Both have the same gender
Here is an example of what I have and what should be the final result:
recAge
Gender
1
F
1.5
M
2
F
2.5
M
The "reference" table:
refAge
refVal
Gender
1
13
F
1.5
17
F
2
12
F
2.5
11
F
1
10
M
1.5
15
M
2
14
M
2.5
19
M
I should be getting this as the result:
Gnder
recAge
refVal
F
1
12
M
1.5
19
F
2
11 >>> since 2+1= 3 and this does not exist in the Reference table
M
2.5
19 >>> same as the previous
I am stuck on how to join the two tables since there is no common key to apply the join on, I tried the following query but it only displays values of equal ages between the two tables.
WITH Ltable AS
(
SELECT
Gender, Age, refVal
FROM
Records
FULL JOIN
Reference ON Records.Age = Reference.Age
WHERE
(Records.Age+1 = Reference.Age)
OR (Records.Age + 1 > Reference.Age)
)
but it only shows me the (Records.Age+1 = Reference.Age) values and the reset of the ages are not matched with their closest smallest reference age to it.
I also tried the join on the gender but the same is happening.
You can use correlated sub-query for this and fetch the TOP 1 result for each row, based on your condition:
SELECT
rec.Gender
, rec.recAge
, (
SELECT TOP 1 refVal
FROM Reference ref
WHERE rec.Gender = ref.Gender
--ORDER BY ABS(rec.recAge + 1 - ref.refAge)
AND ref.refAge <= rec.recAge+1
ORDER BY ref.refAge DESC
) AS refVal
FROM Records rec
I am not sure I got your logic correctly, you might need to tweak, but you should get the idea.
DB<>Fiddle

how to calculate at time series in pig

Lets if I write DUMP monthly, I get:
(Jan,2)
(Feb,102)
(Mar,250)
(Apr,450)
(May,590)
(Jun,790)
(Jul,1040)
(Aug,1260)
(Sep,1440)
(Oct,1770)
(Nov,2000)
(Dec,2500)
Checking schema:
DESCRIBE monthly;
Output:
monthly: {group: chararray,total_case: long}
I need to calculate increase rate for each month. So, for February, it will be:
(total_case in Feb - total_case in Jan) / total_case in Jan = (102 - 2) / 2 = 50
For March it will be: (250 - 102) / 102 = 1.45098039
So, if I put the records in monthlyIncrease, by writing DUMP monthlyIncrease, I will get:
(Jan,0)
(Feb,50)
(Mar,1.45098039)
........
........
(Dec, 0.25)
Is it possible in pig? I can't think of any way to do this.
Possible. Create a similar relation say b.Sort both relations by month. Rank both relations a,b. Join on a.rank = b.rank + 1 and then do the calculations.You will have to union the (Jan,0) record.
Assuming monthly is sorted by the group(month)
monthly = LOAD '/test.txt' USING PigStorage('\t') as (a1:chararray,a2:int);
a = rank monthly;
b = rank monthly;
c = join a by $0, b by ($0 + 1);
d = foreach c generate a::a1,(double)((a::a2 - b::a2)*1.0/(b::a2)*1.0);
e = limit monthly 1;
f = foreach e generate e.$0,0.0;
g = UNION d,f;
dump g;
Result

PIG: How to create percentage (%) based table?

I am trying to create a table that will show the number of occurrence in percentage. For example: I have a table, named as example that contains data as:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
Here, for the class value 1, 'abc' occurred 3 times and 'xyz' occurred only once out of total occurrence of 4 times. For class value 2, 'abc' and 'xyz' occurred once (out of total two times occurrence).
So, the output is:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
Any idea how to do it where both the column values are changing? I was thinking to do it using GROUP. But not sure if I group it by class value, how it could help me.
little bit complex, but here the solution
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)

Multiple ORDER by on Desc in pig

I would like to get the latest date for the cid, also latest amount on the same date.
For latest date I implemented as below
A = LOAD '$input' AS (cid:chararray, date:chararray, amt:chararray,tid:chararray, time:chararray);
B = FOREACH (GROUP A BY (cid,tid)) {
sort = ORDER A BY date DESC;
latest = LIMIT sort 1;
GENERATE FLATTEN(newest);`enter code here`
};'
But I want the latest amount, for that I have multiple records on the same date, so tried to get the amount by ordering on time like below.
AMT = FOREACH (GROUP B BY (cid,tid)){
sort1 = ORDER B BY time DESC;
lastamt = LIMIT sort1 1;
GENERATE FLATTEN(lastamt.amt);
};
I/p :
9822736906^A2015-08-02^A146.08^A^A21:57:05.000000
9822736906^A2015-08-02^A250.12^A58926968^A22:45:30.000000
9822736906^A2015-08-02^A132.1^A00000000^A22:55:29.000000
9822736906^A2015-08-02^A60.97^A00000000^A23:02:48.000000
9826964132^A2015-08-05^A98.2^A^A23:05:46.000000
9822736906^A2015-08-05^A85.71^A4F7581^A23:12:22.000000
9822736906^A2015-08-05^A655.73^A00000000^A23:17:24.000000
O/p should be
9822736906^A2015-08-05^A655.73^A00000000^A23:17:24.000000
9826964132^A2015-08-05^A98.2^A^A23:05:46.000000
9822736906^A2015-08-02^A60.97^A00000000^A23:02:48.000000
If the objective is to select latest record for a cid then the below snippet will work.
Order by date and time in desc order in the same ORDER BY operator.
Input :
9822736906 2015-08-02 146.08 21:57:05.000000
9822736906 2015-08-02 250.12 58926968 22:45:30.000000
9822736906 2015-08-02 132.1 00000000 22:55:29.000000
9822736906 2015-08-02 60.97 00000000 23:02:48.000000
9826964132 2015-08-05 98.2 23:05:46.000000
9822736906 2015-08-05 85.71 4F7581 23:12:22.000000
9822736906 2015-08-05 655.73 00000000 23:17:24.000000
Pig script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (cid:chararray, date:chararray, amt:chararray,tid:chararray, time:chararray);
B = GROUP A BY cid;
C = FOREACH B {
sort = ORDER A BY date DESC, time DESC;
latest = LIMIT sort 1;
GENERATE FLATTEN(latest);
};
Output : DUMP C :
(9822736906,2015-08-05,655.73,00000000,23:17:24.000000)
(9826964132,2015-08-05,98.2,,23:05:46.000000)

How to sort BagToTuple after group with pig?

I have the data set as
id company
1 a
1 b
2 c
2 a
I wrote the code as following:
record = load....
grp = GROUP record BY id;
newdata = FOREACH grp GENERATE group AS id,
COUNT(record) AS counts,
BagToTuple(record.company) AS company;
The output is looks like:
id count company
1 2 a,b
2 2 c,a
But I would like company can be sorted. For example, I need a,c for id 2.
Use Nested Foreach
newdata = FOREACH grp {
sortedbag = order record by company;
GENERATE group AS id,
COUNT(sortedbag) AS counts,
BagToTuple(sortedbag.company) AS company;
};
Where the sortedbag alias contains the data sorted by company in ASCENDING order. In case if you want to sort in DESCENDING, change the statement to
sortedbag = order record by company DESC;