[EDIT] Changed df size to 1k and provided piecemeal code for expected result.
Have the following df:
import random
random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': random.choices(typ, k=sz),
'sub_typ': random.choices(sub_typ, k=sz),
'col_if': random.choices(ifs, k=sz),
'col_sort': col_sort,
'value': col_val})
Would like to sort within groupby of [typ] and [sub_typ] fields, such that it sorts [col_sort] field in ascending order if [col_if] == 'A' and in descending order if [col_if] == 'D' and pick up the first 3 values of the sorted dataframe, in one line of code.
Expected result is like df_result below:
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
Tried:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort', ascending=True)
if x.col_if == 'A' else x.sort_values('col_sort',
ascending=False)).groupby(['typ', 'sub_typ', 'col_sort']).head(3)
...but it gives the error:
ValueError: The truth value of a Series is ambiguous.
Sorting per groups is same like sorting by multiple columns, but if need same output is necessary kind='mergesort'.
So for improve performance I suggest NOT sorting per groups in groupby:
np.random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': np.random.choice(typ, sz),
'sub_typ': np.random.choice(sub_typ, sz),
'col_if': np.random.choice(ifs, sz),
'col_sort': col_sort,
'value': col_val})
# print (df)
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = (df_A.sort_values(['typ', 'sub_typ','col_sort'])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_D_sorted_3 = (df_D.sort_values(['typ', 'sub_typ','col_sort'], ascending=[True, True, False])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
Compare outputs:
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
EDIT: Possible, but slow:
def f(x):
a = x[x.col_if == 'A'].sort_values('col_sort', ascending=True, kind='mergesort')
d = x[x.col_if == 'D'].sort_values('col_sort', ascending=False, kind='mergesort')
return pd.concat([a,d], sort=False)
df_result = (df.groupby(['typ', 'sub_typ','col_if'], as_index=False, group_keys=False)
.apply(f)
.groupby(['typ', 'sub_typ', 'col_sort', 'col_if'])
.head(3))
print (df_result)
typ sub_typ col_if col_sort value
242 a s1 A 0 709
535 a s1 A 0 710
589 a s1 A 0 801
111 a s1 A 1 542
209 a s1 A 1 557
.. .. ... ... ... ...
39 c s4 D 1 555
211 c s4 D 1 233
13 c s4 D 0 501
614 c s4 D 0 436
658 c s4 D 0 695
[651 rows x 5 columns]
You wrote that col_if should act as a "switch" to the sort order.
But note that each group (at least for your seeding of random) contains
both A and D in col_sort column, so your requirement is ambiguous.
One of possible solutions is to perform a "majority vote" in each group,
i.e. the sort order in particular group is to be ascending if there are
more or equal A values than D. Note that I arbitrarily chose the
ascending order in the "equal" case, maybe you should take the other option.
A doubtful point in your requirements (and hence the code) is that you
put .head(3) after the group processing.
This way you get first 3 rows from the first group only.
Maybe you want 3 initial rows from each group?
In this case head(3) should be inside the lambda function (as I wrote
below).
So change your code to:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort',
ascending=(x.col_if.eq('A').sum() >= x.col_if.eq('D').sum())).head(3))
As you can see, the sort order can be expressed as a bool expression for
ascending, instead of 2 similar expressions differing only in ascending
parameter.
Related
I want to add relationships to column 'relations' based on rel_list. Specifically, for each tuple, i.e. ('a', 'b'), I want to replace the relationships column value '' with 'b' in the first row, but no duplicate, meaning that for the 2nd row, don't replace '' with 'a', since they are considered as duplicated. The following code doesn't work fully correct:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
for rel_tuple in rel_list:
head = rel_tuple[0]
tail = rel_tuple[1]
df.loc[df.names == head, 'relations'] = tail
print(df)
The current result of df is:
names ages relations
0 a 50 c
1 b 40
2 c 45 d
3 d 20
However, the correct one is:
names ages relations
0 a 50 b
0 a 50 c
1 b 40
2 c 45 d
3 d 20
There are new rows that need to be added. The 2nd row in this case, like above. How to do that?
You can craft a dataframe and merge:
(df.drop('relations', axis=1)
.merge(pd.DataFrame(rel_list, columns=['names', 'relations']),
on='names',
how='outer'
)
# .fillna('') # uncomment to replace NaN with empty string
)
Output:
names ages relations
0 a 50 b
1 a 50 c
2 b 40 NaN
3 c 45 d
4 d 20 NaN
Instead of updating df you can create a new one and add relations row by row:
import pandas as pd
data = {
"names": ['a', 'b', 'c', 'd'],
"ages": [50, 40, 45, 20],
"relations": ['', '', '', '']
}
rel_list = [('a', 'b'), ('a', 'c'), ('c', 'd')]
df = pd.DataFrame(data)
new_df = pd.DataFrame(data)
new_df.loc[:, 'relations'] = ''
for head, tail in rel_list:
new_row = df[df.names == head]
new_row.loc[:,'relations'] = tail
new_df = new_df.append(new_row)
print(new_df)
Output:
names ages relations
0 a 50
1 b 40
2 c 45
3 d 20
0 a 50 b
0 a 50 c
2 c 45 d
Then, if needed, in the end you can delete all rows without value in 'relations':
new_df = new_df[new_df['relations']!='']
Here is my code:
def value_and_wage_conversion(value):
if isinstance(value,str):
if 'M' in out:
out = float(out.replace('M', ''))*1000000
elif 'K' in value:
out = float(out.replace('K', ''))*1000
return float(out)
fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
Here is the error message:
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call
last) in
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
c:\users\brain\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py
in apply(self, func, convert_dtype, args, **kwds) 4136
else: 4137 values = self.astype(object)._values
-> 4138 mapped = lib.map_infer(values, f, convert=convert_dtype) 4139 4140 if len(mapped) and
isinstance(mapped[0], Series):
pandas_libs\lib.pyx in pandas._libs.lib.map_infer()
in (x)
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
in value_and_wage_conversion(value)
1 def value_and_wage_conversion(value):
2 if isinstance(value,str):
----> 3 if 'M' in out:
4 out = float(out.replace('M', ''))*1000000
5 elif 'K' in value:
UnboundLocalError: local variable 'out' referenced before assignment
You were almost there but you need to fix your function
For example
import numpy as np
import pandas as pd
# generate a random sample
values = ['10M', '10K', 10.5, '200M', '200K', 200]
size = 100
np.random.seed(1)
df = pd.DataFrame({
'Value': np.random.choice(values, size),
'Wage': np.random.choice(values, size),
})
print(df)
Value Wage
0 200 200
1 200M 200M
2 200K 200
3 10M 10M
4 10K 200M
.. ... ...
95 200K 200
96 200 200M
97 10.5 200K
98 200K 10.5
99 200M 10M
[100 rows x 2 columns]
Define function and apply
def value_and_wage_conversion(value):
if isinstance(value, str):
if 'M' in value:
value = float(value.replace('M', ''))*1000000
elif 'K' in value:
value = float(value.replace('K', ''))*1000
return float(value)
df['Value'] = df['Value'].apply(lambda x: value_and_wage_conversion(x))
df['Wage'] = df['Wage'].apply(lambda x: value_and_wage_conversion(x))
print(df)
Value Wage
0 200.0 200.0
1 200000000.0 200000000.0
2 200000.0 200.0
3 10000000.0 10000000.0
4 10000.0 200000000.0
.. ... ...
95 200000.0 200.0
96 200.0 200000000.0
97 10.5 200000.0
98 200000.0 10.5
99 200000000.0 10000000.0
[100 rows x 2 columns]
and check
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 100 non-null float64
1 Wage 100 non-null float64
dtypes: float64(2)
memory usage: 1.7 KB
Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0
I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?
you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}