Lagging/Differences in Pig - apache-pig

Is there a way to generate "column" lags or differences in Pig? Here's an example of what I'm trying to do, transcribed from R:
Value Value.Diff
1 0.209 NA
2 0.198 -0.011
3 0.187 -0.011
4 0.176 -0.011
5 0.168 -0.008
6 0.159 -0.009
I realize this might be tricky given the (presumably) distributed nature of Pig's data storage, but thought it might be possible given that Pig 0.11+ allows you to rank tuples.

Something like this should work:
values = rank values by some_field;
values = foreach values generate $0 as this_rank, $0 - 1 as prev_rank, value;
copy = foreach values generate *;
pairs = join values by this_rank, copy by prev_rank;
diffs = foreach pairs generate this_rank, values::value - copy::value as diff;

Related

SQL Query to return which columns have different values given two rows

I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]

I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39.
I don’t want to specify each and every field like $0, $1,$2 etc
Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case?
You can use the .. notation.
First 40 fields
B = FOREACH A GENERATE $0..$39;
All fields
B = FOREACH A GENERATE $0..;
Multiple ranges,for example first 10,15-20,25-50
B = FOREACH A GENERATE $0..$9,$14..$19,$24..;
Random fields 22,33-44,46
B = FOREACH A GENERATE $21,$32..$43,$45;

Issue in Loading data from Movielens into pig

I'm trying to load some data into Pig:
Record:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
Script Used:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
Output
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
How to tackle the colon(:) after Dracula.-?
due to the colon, the second column is getting split into 2 columns and since we have in total of 3 columns, the last column of movieid 12 comedy|horror doesn't get loaded.
You can achieve this using REGEX_EXTRACT_ALL.
Following is the piece of code, which achieves this:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;
I got the following output (which is a tuple). ":" after "Dracula" is intact.
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)

Pig Script - Min, Avg, Max

Let us say I have these in a file ...
1
2
3
Using a Pig Script, how can I get this (number, minimum, mean, maximum in each line) ?
1,1,2,3
2,1,2,3
3,1,2,3
Please let me know the Pig Script. I am able to get the MIN, AVG, MAX using Pig built in functions, but am not able to get them all in each line.
Thanks
Naga
Use the TOBAG built-in UDF to get your fields into a bag, and then you can use the MIN, AVG, and MAX functions on that bag. You should have no trouble using all three summary functions on a single record.
Here is my simple solution for the problem.
I had the following numbers as input,
temp2.txt
1
2
3
4
5
.
.
16
17
18
19
20
I followed these steps,
1]loaded the data from the file
2]Then grouped all the data
3]Found Average,Minimum,Maximum from the grouped data
4]Then foreach value in loaded data generated data and the minimum , maximum and average values.
The code is as follows,
grunt> data = load '/home/temp2.txt' as (val);
grunt> g = group data all;
grunt> avg = foreach g generate AVG(data.val) as a;
grunt> min = foreach g generate MIN(data.val) as m;
grunt> max = foreach g generate MAX(data.val) as x;
grunt> values = foreach data generate val,min.m,max.x,avg.a;
grunt> dump values;
The following is the output,
Output
(1,1.0,20.0,10.5)
(2,1.0,20.0,10.5)
(3,1.0,20.0,10.5)
(4,1.0,20.0,10.5)
(5,1.0,20.0,10.5)
(6,1.0,20.0,10.5)
(7,1.0,20.0,10.5)
(8,1.0,20.0,10.5)
(9,1.0,20.0,10.5)
(10,1.0,20.0,10.5)
(11,1.0,20.0,10.5)
(12,1.0,20.0,10.5)
(13,1.0,20.0,10.5)
(14,1.0,20.0,10.5)
(15,1.0,20.0,10.5)
(16,1.0,20.0,10.5)
(17,1.0,20.0,10.5)
(18,1.0,20.0,10.5)
(19,1.0,20.0,10.5)
(20,1.0,20.0,10.5)