how to filter by day with pandas - pandas

I am reading a series of data from file via pd.read_csv().
Then somehow I create a dataframe like the following:
col1 col2
01/01/2001 a1 a2
02/01/2001 b1 b2
03/01/2001 c1 c2
04/01/2001 d1 d2
01/01/2002 e1 e2
02/01/2002 f1 d2
03/01/2002 g1 g2
04/01/2002 h1 h2
What I would like to do is to groub by the same day, and assign to it a value, I mean:
col1
01/01 ax
02/01 bx
03/01 cx
04/01 dx
Does anyone have any clues how to perform this smoothly?
Thanks a lot in advance.
LS

First thing I'd do is make sure your index are dates. If you know they are, then skip this.
df.index = pd.to_datetime(df.index)
Then you groupby with something like [df.index.month, df.index.day] or df.index.strftime('%m-%d'). However, you have to choose to aggregate or transform. You didn't specify what you wanted to do, so I chose the first function to aggregate.
df.groupby(df.index.strftime('%m-%d')).first()
col1 col2
01-01 a1 a2
02-01 b1 b2
03-01 c1 c2
04-01 d1 d2

Related

Finding max row after groupby in pandas dataframe

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.
Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
(df.sort_values('Val')
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
)
Output:
Col2 a1 b2
Month
A q r
B x z

How can I multiply each row by a number depending on the values of the row?

I have this DF and I'm looking to multiply the the number of rows depending in the amount of words col3 has. Is this something that can be done in Python?
col1 col2 col3
A1 B1 a - ab - abc
A13 B13 a - ab
A27 B27 abcd
desired output
col1 col2 col3
A1 B1 a - ab - abc
A1 B1 a - ab - abc
A1 B1 a - ab - abc
A13 B13 a - ab
A13 B13 a - ab
A27 B27 abcd
Use Index.repeat with Series.str.count for counting words and then repeat rows by DataFrame.loc:
df = df.loc[df.index.repeat(df['col3'].str.count('\w+'))].reset_index(drop=True)
print (df)
col1 col2 col3
0 A1 B1 a - ab - abc
1 A1 B1 a - ab - abc
2 A1 B1 a - ab - abc
3 A13 B13 a - ab
4 A13 B13 a - ab
5 A27 B27 abcd
If there are always separated words by - is possible count it and add 1:
df = df.loc[df.index.repeat(df['col3'].str.count('-') + 1)].reset_index(drop=True)
Or solution by #sammywemmy, thank you with splitting and length of lists:
df.loc[df.index.repeat(df.col3.str.split('-').str.len())].reset_index(drop=True)

Split value from a data.frame and create additional row to store its component

In R, I have a data frame called df such as the following:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 - 7
a4 b4 c4 2.5
I want to split the value of the third row and D column by the dash and create another row for the second value retaining the other values for that row.
So I want this:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5
a3 b3 c3 7
a4 b4 c4 2.5
Any idea how this can be achieved?
Ideally, I would also want to create an extra column to specify whether the value I split is either a minimum or maximum.
So this:
A B C D E
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 min
a3 b3 c3 7 max
a4 b4 c4 2.5
Thanks.
One option would be to use sub to paste 'min' and 'max in the 'D" column where - is found, and then use cSplit to split the 'D' column.
library(splitstackshape)
df1$D <- sub('(\\d+) - (\\d+)', '\\1,min - \\2,max', df1$D)
res <- cSplit(cSplit(df1, 'D', ' - ', 'long'), 'D', ',')[is.na(D_2), D_2 := '']
setnames(res, 4:5, LETTERS[4:5])
res
# A B C D E
#1: a1 b1 c1 2.5
#2: a2 b2 c2 3.5
#3: a3 b3 c3 5.0 min
#4: a3 b3 c3 7.0 max
#5: a4 b4 c4 2.5
Here's a dplyrish way:
DF %>%
group_by(A,B,C) %>%
do(data.frame(D = as.numeric(strsplit(as.character(.$D), " - ")[[1]]))) %>%
mutate(E = if (n()==2) c("min","max") else "")
A B C D E
(fctr) (fctr) (fctr) (dbl) (chr)
1 a1 b1 c1 2.5
2 a2 b2 c2 3.5
3 a3 b3 c3 5.0 min
4 a3 b3 c3 7.0 max
5 a4 b4 c4 2.5
Dplyr has a policy against expanding rows, as far as I can tell, so the ugly
do(data.frame(... .$ ...))
construct is required. If you are open to data.table, it's arguably simpler here:
library(data.table)
setDT(DF)[,{
D = as.numeric(strsplit(as.character(D)," - ")[[1]])
list(D = D, E = if (length(D)==2) c("min","max") else "")
}, by=.(A,B,C)]
A B C D E
1: a1 b1 c1 2.5
2: a2 b2 c2 3.5
3: a3 b3 c3 5.0 min
4: a3 b3 c3 7.0 max
5: a4 b4 c4 2.5
We can use tidyr::separate_rows. I altered the input to include a negative value to makeit more general :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,text=
"A B C D
a1 b1 c1 -2.5
a2 b2 c2 3.5
a3 b3 c3 '5 - 7'
a4 b4 c4 2.5")
library(dplyr)
library(tidyr)
df %>%
mutate(E="", E = replace(E, grepl("[^^]-",D), "min - max")) %>%
separate_rows(D,E,sep = "[^^]-", convert = TRUE)
#> A B C D E
#> 1 a1 b1 c1 -2.5
#> 2 a2 b2 c2 3.5
#> 3 a3 b3 c3 5.0 min
#> 4 a3 b3 c3 7.0 max
#> 5 a4 b4 c4 2.5

Generating binary variables in Pig

I am newbie to the world of Pig and I need to implement the following scenario.
problem:
Input to pig script: Any arbitrary relation say as below table
A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3
we have to generate binary columns based on B,C so my output will look something like this.
output
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
Can someone let me know how to achieve this in pig? i know this can be easily achieved using R script but my requirement is to achieve via PIG.
Your help will be highly appreciated.
Can you try this?
input
a1 b1 c1
a2 b2 c2
a1 b1 c3
PigScript:
X = LOAD 'input' USING PigStorage() AS (A:chararray,B:chararray,C:chararray);
Y = FOREACH X GENERATE A,B,C,
((B=='b1')?1:0) AS Bb1,
((B=='b2')?1:0) AS Bb2,
((C=='c1')?1:0) AS Cc1,
((C=='c2')?1:0) AS Cc2,
((C=='c3')?1:0) AS Cc3;
DUMP Y;
Output:
(a1,b1,c1,1,0,1,0,0)
(a2,b2,c2,0,1,0,1,0)
(a1,b1,c3,1,0,0,0,1)

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)