Having a key error when using group and sum in a dataframe - pandas

I would like to use groupby and sum a csv file
a b c d
1111 0.1 1 1
1111 0 1 0
2222 0.2 1 1
1111 0.2 2 1
2222 1 1
1111 0.3 2 0
3333 0.4 1 1
3333 0.5 2 1
1111 0.6 2 1
e: # if b < 0.2, group column a and sum of column c
f: # If b >= 0.2 group column a and sum of column c
g: # If d = 1, and b >= 0.2, g is sum of c
h: # If d = 0 and b < 0.2, h is sum of c
expected output:
e f g h
1111 2 6 4 1
2222 1 1
3333 3 3
I try:
df1 = df[(df['d'] == 1) & (df['b'] >= 0.2)]
df1.groupby('a')['c'].sum()
However, I got key error in a large file:
pandas.index.IndexEngine.get_loc, pandas.hastable.PobjectHashTable.get_item in column a.

Maybe you can try different approach:
First create conditions columns - e to h and then use mul for filling this mask with values of column c. Last use GroupBy.sum:
df['e'] = df['b'] < 0.2
df['f'] = df['b'] >= 0.2
df['g'] = (df['d'] == 1) & (df['b'] >= 0.2)
df['h'] = (df['d'] == 0) & (df['b'] < 0.2)
print df
a b c d e f g h
0 1111 0.1 1 1 True False False False
1 1111 0.0 1 0 True False False True
2 2222 0.2 1 1 False True True False
3 1111 0.2 2 1 False True True False
4 2222 NaN 1 1 False False False False
5 1111 0.3 2 0 False True False False
6 3333 0.4 1 1 False True True False
7 3333 0.5 2 1 False True True False
8 1111 0.6 2 1 False True True False
df.loc[:, ['e','f','g','h']]= df.loc[:, ['e','f','g','h']].mul(df.c, axis=0)
print df
a b c d e f g h
0 1111 0.1 1 1 1 0 0 0
1 1111 0.0 1 0 1 0 0 1
2 2222 0.2 1 1 0 1 1 0
3 1111 0.2 2 1 0 2 2 0
4 2222 NaN 1 1 0 0 0 0
5 1111 0.3 2 0 0 2 0 0
6 3333 0.4 1 1 0 1 1 0
7 3333 0.5 2 1 0 2 2 0
8 1111 0.6 2 1 0 2 2 0
df1 = df.groupby('a').sum()
print df1[['e','f','g','h']]
e f g h
a
1111 2 6 4 1
2222 0 1 1 0
3333 0 3 3 0

Related

generate date feature column using pandas

I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?
Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result

Pandas apply function by group returning multiple new columns

I am trying to apply a function to a column by group with the objective of creating 2 new columns, containing the returned values of the function for each group. Example as follows:
def testms(x):
mu = np.sum(x)
si = np.sum(x)/2
return mu, si
df = pd.concat([pd.DataFrame({'A' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]}), pd.DataFrame({'B' : np.random.rand(10)})],axis=1)
df
A B
0 1 0.696761
1 1 0.035178
2 1 0.468180
3 1 0.157818
4 1 0.281470
5 2 0.377689
6 2 0.336046
7 2 0.005879
8 2 0.747436
9 2 0.772405
desired_result =
A B mu si
0 1 0.696761 1.652595 0.826297
1 1 0.035178 1.652595 0.826297
2 1 0.468180 1.652595 0.826297
3 1 0.157818 1.652595 0.826297
4 1 0.281470 1.652595 0.826297
5 2 0.377689 2.997657 1.498829
6 2 0.336046 2.997657 1.498829
7 2 0.005879 2.997657 1.498829
8 2 0.747436 2.997657 1.498829
9 2 0.772405 2.997657 1.498829
I think I have found a solution but I am looking for something a bit more elegant and efficient:
x = df.groupby('A')['B'].apply(lambda x: pd.Series(testms(x),index=['mu','si']))
A
1 mu 1.652595
si 0.826297
2 mu 2.997657
si 1.498829
Name: B, dtype: float64
df.merge(x.drop(labels='mu',level=1),on='A',how='outer').merge(x.drop(labels='si',level=1),on='A',how='outer')
One idea is change function for create new columns filled by mu and si values and return x for return group:
def testms(x):
mu = np.sum(x['B'])
si = np.sum(x['B'])/2
x['mu'] = mu
x['si'] = si
return x
x = df.groupby('A').apply(testms)
print (x)
A B mu si
0 1 0.352297 3.590048 1.795024
1 1 0.860488 3.590048 1.795024
2 1 0.939260 3.590048 1.795024
3 1 0.988280 3.590048 1.795024
4 1 0.449723 3.590048 1.795024
5 2 0.125852 1.300524 0.650262
6 2 0.853474 1.300524 0.650262
7 2 0.000996 1.300524 0.650262
8 2 0.223886 1.300524 0.650262
9 2 0.096316 1.300524 0.650262
Your solution should be simplify with Series.unstack and DataFrame.join:
df1 = df.groupby('A')['B'].apply(lambda x: pd.Series(testms(x),index=['mu','si'])).unstack()
x = df.join(df1, on='A')
print (x)
A B mu si
0 1 0.085961 2.791346 1.395673
1 1 0.887589 2.791346 1.395673
2 1 0.685952 2.791346 1.395673
3 1 0.946613 2.791346 1.395673
4 1 0.185231 2.791346 1.395673
5 2 0.994415 3.173444 1.586722
6 2 0.159852 3.173444 1.586722
7 2 0.773711 3.173444 1.586722
8 2 0.867337 3.173444 1.586722
9 2 0.378128 3.173444 1.586722

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

pandas most efficient way to compare dataframe and series

I have a dataframe of shape (n, p) and a series of length n
I can compare them with:
for i in df.keys():
df[i] > ts
Is there a way to do it in one line? something like df > ts.
if yes, is it more efficient?
I think you need DataFrame.gt:
print (df.gt(s, axis=0))
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
s = pd.Series([1,2,3])
print (s)
0 1
1 2
2 3
dtype: int64
print (df.gt(s, axis=0))
A B C D E F
0 False True True False True True
1 False True True True True True
2 False True True True True False
If need another functions for compare:
lt
gt
le
ge
ne
eq

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)