solve cumulative data in Pig Latin - apache-pig

I am getting some data from meter.
Example:-
Date KWH
2018-12-01 50
2018-12-02 90
2018-12-03 150
I want to extract the actual value of KWH through Pig Code.
Expected:-
Date KWH
2018-12-02 40
2018-12-03 60

Referring to previous record is tough in hadoop since we split the inputs and assign them to different tasks. I think the following way would work but it's inefficient (compared to a single process reading the data sequentially).
A = LOAD 'test.txt' AS (a1:chararray, a2:int);
B = FOREACH A GENERATE ToDate(a1, 'y-M-d', 'UTC') as date, a2;
C = FOREACH B GENERATE AddDuration(date, 'P1D') as nextdate, -a2 as a2;
D = join B by date, C by nextdate;
E = FOREACH D GENERATE B::date as date, B::a2 + C::a2 as value;
dump E;

Related

Is there a way to count the individual instances of an event per year?

I am working on Apache Pig to get an understanding of working with large databases. The specific problem is, I need to count the number of days per year for all years listed in the dataset when the temperature in the recorded area was recorded to be above 80 degrees.
The data is set up in the following manner.
Date Max Temp
1919-06-03, 36
1919-11-26, 91
1927-09-23, 61
This repeats every day for about 200 years.
Currently, I know that to make this more manageable I will be using the split function, to split the data set based on the temp being above 80 degrees.
SPLIT data INTO max_above_95 if max_t > 80;
I also figured that if you can get the year out of the date, you can group by, after splitting to get the intended results and count.
I, however, could not find a method to use the year's chunk of the date.
I need this to in the end output giving each year, and the number of occurrences for that year such as the following:
(1993, 21)
(1994, 7)
(1995, 13)
Use FILTER and then extract the year,group by year,count the occurrences.
B = FILTER A BY (A.max_t > 80);
C = FOREACH B GENERATE B.Date,GetYear(B.Date) as Year,max_t;
D = GROUP C BY Year;
E = FOREACH D GENERATE FLATTEN(group) as Year,COUNT(C.max_t);
DUMP E;

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Calculating percentage using PIG latin

I have a table with two columns (code:chararray, sp:double)
I want to calculate the percentage of every sp.
INPUT
t001 60
a002 75
a003 34
bb04 56
bbc5 23
cc2c 45
ddc5 45
desired OUTPUT:
code Perc
t001 17%
a002 22%
a003 10%
bb04 16.5%
bbc5 6%
cc2c 13.3%
ddc5 13.3%
I tried like this but output is not coming.
A = load '....' as (code : chararray, sp : double);
B = GROUP A BY (code);
allcount = FOREACH B GENERATE SUM(A.speed) as total;
perc = FOREACH A GENERATE code,speed/(double)allcount.total * 100;
dump perc;
How can i do using pig latin?
You are loading the second column into a field called sp but referring to it as speed.I presume that the columns are separated by a single while space,if it is a tab then use PigStorage('\t') in the LOAD statement.
A = LOAD '/YourFilePath/YourFile.txt' USING PigStorage(' ') AS (code:chararray, sp:double);
B = GROUP A ALL;
C = FOREACH B GENERATE SUM(A.sp) AS total;
D = FOREACH A GENERATE code,ROUND_TO((sp/(double)C.total) * 100,2) AS perc;
E = FOREACH D GENERATE code,CONCAT((chararray)perc,'%');
DUMP E;
OUTPUT:

PIG Script to handle nth-1 record

Input File Structure : records are sorted based on the time stamp
Expected input fiel size will be :2-3TB
timestamp
==============
20141014120523
20141014120534
20141014120537
20141014120542
20141014120549
20141014120555
20141014120565
20141014120570
20141014120512
...
...
Using PIG I need to find the time difference between the Nth record and Nth-1 Record time stamp (20141014120534 - 20141014120523 = 11 secs).
I need to loop through all the records to get the time difference from previous record
Example Output
0
11
3
5
...
Please help me with the right resources/references/solutions.
Can you try this?
input.txt
20141014120523
20141014120534
20141014120537
20141014120542
20141014120549
20141014120555
20141014120565
20141014120570
PigScript:
A = LOAD 'input.txt' using PigStorage() as (time:long);
B = RANK A;
C = FILTER B BY rank_A;
D = FILTER B BY rank_A > 1;
E = FOREACH D GENERATE ($0-1),$1;
F = JOIN B BY $0, E BY $0;
G = FOREACH F GENERATE (E::time - B::time);
DUMP G;
Output:
(11)
(3)
(5)
(7)
(6)
(10)
(5)

Convert data in a specific format in Apache Pig.

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan
Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;