Generating binary variables in Pig - apache-pig

I am newbie to the world of Pig and I need to implement the following scenario.
problem:
Input to pig script: Any arbitrary relation say as below table
A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3
we have to generate binary columns based on B,C so my output will look something like this.
output
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
Can someone let me know how to achieve this in pig? i know this can be easily achieved using R script but my requirement is to achieve via PIG.
Your help will be highly appreciated.

Can you try this?
input
a1 b1 c1
a2 b2 c2
a1 b1 c3
PigScript:
X = LOAD 'input' USING PigStorage() AS (A:chararray,B:chararray,C:chararray);
Y = FOREACH X GENERATE A,B,C,
((B=='b1')?1:0) AS Bb1,
((B=='b2')?1:0) AS Bb2,
((C=='c1')?1:0) AS Cc1,
((C=='c2')?1:0) AS Cc2,
((C=='c3')?1:0) AS Cc3;
DUMP Y;
Output:
(a1,b1,c1,1,0,1,0,0)
(a2,b2,c2,0,1,0,1,0)
(a1,b1,c3,1,0,0,0,1)

Related

how to filter by day with pandas

I am reading a series of data from file via pd.read_csv().
Then somehow I create a dataframe like the following:
col1 col2
01/01/2001 a1 a2
02/01/2001 b1 b2
03/01/2001 c1 c2
04/01/2001 d1 d2
01/01/2002 e1 e2
02/01/2002 f1 d2
03/01/2002 g1 g2
04/01/2002 h1 h2
What I would like to do is to groub by the same day, and assign to it a value, I mean:
col1
01/01 ax
02/01 bx
03/01 cx
04/01 dx
Does anyone have any clues how to perform this smoothly?
Thanks a lot in advance.
LS
First thing I'd do is make sure your index are dates. If you know they are, then skip this.
df.index = pd.to_datetime(df.index)
Then you groupby with something like [df.index.month, df.index.day] or df.index.strftime('%m-%d'). However, you have to choose to aggregate or transform. You didn't specify what you wanted to do, so I chose the first function to aggregate.
df.groupby(df.index.strftime('%m-%d')).first()
col1 col2
01-01 a1 a2
02-01 b1 b2
03-01 c1 c2
04-01 d1 d2

Compare Pandas dataframes and add column

I have two dataframe as below
df1 df2
A A C
A1 A1 C1
A2 A2 C2
A3 A3 C3
A1 A4 C4
A2
A3
A4
The values of column 'A' are defined in df2 in column 'C'.
I want to add a new column to df1 with column B with its value from df2 column 'C'
The final df1 should look like this
df1
A B
A1 C1
A2 C2
A3 C3
A1 C1
A2 C2
A3 C3
A4 C4
I can loop over df2 and add the value to df1 but its time consuming as the data is huge.
for index, row in df2.iterrows():
df1.loc[df1.A.isin([row['A']]), 'B']= row['C']
Can someone help me to understand how can I solve this without looping over df2.
Thanks
You can use map by Series:
df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
It is same as map by dict:
d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
df1['B'] = df1.A.map(d)
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
Timings:
len(df1)=7:
In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop
In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop
len(df1)=70k:
In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop
In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop
IIUC you can just merge and rename the col
df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged
Out[103]:
A B
0 A1 C1
1 A2 C2
2 A3 C4
3 A1 C1
4 A2 C2
5 A3 C4
6 A4 C4
Based on searchsorted method, here are three approaches with different indexing schemes -
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]

Split value from a data.frame and create additional row to store its component

In R, I have a data frame called df such as the following:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 - 7
a4 b4 c4 2.5
I want to split the value of the third row and D column by the dash and create another row for the second value retaining the other values for that row.
So I want this:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5
a3 b3 c3 7
a4 b4 c4 2.5
Any idea how this can be achieved?
Ideally, I would also want to create an extra column to specify whether the value I split is either a minimum or maximum.
So this:
A B C D E
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 min
a3 b3 c3 7 max
a4 b4 c4 2.5
Thanks.
One option would be to use sub to paste 'min' and 'max in the 'D" column where - is found, and then use cSplit to split the 'D' column.
library(splitstackshape)
df1$D <- sub('(\\d+) - (\\d+)', '\\1,min - \\2,max', df1$D)
res <- cSplit(cSplit(df1, 'D', ' - ', 'long'), 'D', ',')[is.na(D_2), D_2 := '']
setnames(res, 4:5, LETTERS[4:5])
res
# A B C D E
#1: a1 b1 c1 2.5
#2: a2 b2 c2 3.5
#3: a3 b3 c3 5.0 min
#4: a3 b3 c3 7.0 max
#5: a4 b4 c4 2.5
Here's a dplyrish way:
DF %>%
group_by(A,B,C) %>%
do(data.frame(D = as.numeric(strsplit(as.character(.$D), " - ")[[1]]))) %>%
mutate(E = if (n()==2) c("min","max") else "")
A B C D E
(fctr) (fctr) (fctr) (dbl) (chr)
1 a1 b1 c1 2.5
2 a2 b2 c2 3.5
3 a3 b3 c3 5.0 min
4 a3 b3 c3 7.0 max
5 a4 b4 c4 2.5
Dplyr has a policy against expanding rows, as far as I can tell, so the ugly
do(data.frame(... .$ ...))
construct is required. If you are open to data.table, it's arguably simpler here:
library(data.table)
setDT(DF)[,{
D = as.numeric(strsplit(as.character(D)," - ")[[1]])
list(D = D, E = if (length(D)==2) c("min","max") else "")
}, by=.(A,B,C)]
A B C D E
1: a1 b1 c1 2.5
2: a2 b2 c2 3.5
3: a3 b3 c3 5.0 min
4: a3 b3 c3 7.0 max
5: a4 b4 c4 2.5
We can use tidyr::separate_rows. I altered the input to include a negative value to makeit more general :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,text=
"A B C D
a1 b1 c1 -2.5
a2 b2 c2 3.5
a3 b3 c3 '5 - 7'
a4 b4 c4 2.5")
library(dplyr)
library(tidyr)
df %>%
mutate(E="", E = replace(E, grepl("[^^]-",D), "min - max")) %>%
separate_rows(D,E,sep = "[^^]-", convert = TRUE)
#> A B C D E
#> 1 a1 b1 c1 -2.5
#> 2 a2 b2 c2 3.5
#> 3 a3 b3 c3 5.0 min
#> 4 a3 b3 c3 7.0 max
#> 5 a4 b4 c4 2.5

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)

pig order by with rank and join the rank together

I have the following data with the schema (t0:chararray, t1:int)
a0 1
a1 7
b2 9
a2 4
b0 6
And I want to order it t1 and then combine with a rank
a0 1 1
a2 4 2
b0 6 3
a1 7 4
b2 9 5
Is there any convenient way without writing UDF in pig?
There is the RANK operation in Pig. This should be sufficient:
X = rank A by t1 ASC;
Please see the Pig docs for more details.