Pandas custom file format - pandas

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?

you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}

Related

average on dataframe segments

In the following picture, I have DataFrame that renders zero after each cycle of operation (the cycle has random length). I want to calculate the average (or perform other operations) for each patch. For example, the average of [0.762, 0.766] alone, and [0.66, 1.37, 2.11, 2.29] alone and so forth till the end of the DataFrame.
So I worked with this data :
random_value
0 0
1 0
2 1
3 2
4 3
5 0
6 4
7 4
8 0
9 1
There is probably a way better solution, but here is what I came with :
def avg_function(df):
avg_list = []
value_list = list(df["random_value"])
temp_list = []
for i in range(len(value_list)):
if value_list[i] == 0:
if temp_list:
avg_list.append(sum(temp_list) / len(temp_list))
temp_list = []
else:
temp_list.append(value_list[i])
if temp_list: # for the last values
avg_list.append(sum(temp_list) / len(temp_list))
return avg_list
test_list = avg_function(df=df)
test_list
[Out] : [2.0, 4.0, 1.0]
Edit: since requested in the comments, here is a way to add the means to the dataframe. I dont know if there is a way to do that with pandas (and there might be!), but I came up with this :
def add_mean(df, mean_list):
temp_mean_list = []
list_index = 0 # will be the index for the value of mean_list
df["random_value_shifted"] = df["random_value"].shift(1).fillna(0)
random_value = list(df["random_value"])
random_value_shifted = list(df["random_value_shifted"])
for i in range(df.shape[0]):
if random_value[i] == 0 and random_value_shifted[i] == 0:
temp_mean_list.append(0)
elif random_value[i] == 0 and random_value_shifted[i] != 0:
temp_mean_list.append(0)
list_index += 1
else:
temp_mean_list.append(mean_list[list_index])
df = df.drop(["random_value_shifted"], axis=1)
df["mean"] = temp_mean_list
return df
df = add_mean(df=df, mean_list=mean_list
Which gave me :
df
[Out] :
random_value mean
0 0 0
1 0 0
2 1 2
3 2 2
4 3 2
5 0 0
6 4 4
7 4 4
8 0 0
9 1 1

Converting a string to number in jupyter

Here is my code:
def value_and_wage_conversion(value):
if isinstance(value,str):
if 'M' in out:
out = float(out.replace('M', ''))*1000000
elif 'K' in value:
out = float(out.replace('K', ''))*1000
return float(out)
fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
Here is the error message:
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call
last) in
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
c:\users\brain\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py
in apply(self, func, convert_dtype, args, **kwds) 4136
else: 4137 values = self.astype(object)._values
-> 4138 mapped = lib.map_infer(values, f, convert=convert_dtype) 4139 4140 if len(mapped) and
isinstance(mapped[0], Series):
pandas_libs\lib.pyx in pandas._libs.lib.map_infer()
in (x)
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
in value_and_wage_conversion(value)
1 def value_and_wage_conversion(value):
2 if isinstance(value,str):
----> 3 if 'M' in out:
4 out = float(out.replace('M', ''))*1000000
5 elif 'K' in value:
UnboundLocalError: local variable 'out' referenced before assignment
You were almost there but you need to fix your function
For example
import numpy as np
import pandas as pd
# generate a random sample
values = ['10M', '10K', 10.5, '200M', '200K', 200]
size = 100
np.random.seed(1)
df = pd.DataFrame({
'Value': np.random.choice(values, size),
'Wage': np.random.choice(values, size),
})
print(df)
Value Wage
0 200 200
1 200M 200M
2 200K 200
3 10M 10M
4 10K 200M
.. ... ...
95 200K 200
96 200 200M
97 10.5 200K
98 200K 10.5
99 200M 10M
[100 rows x 2 columns]
Define function and apply
def value_and_wage_conversion(value):
if isinstance(value, str):
if 'M' in value:
value = float(value.replace('M', ''))*1000000
elif 'K' in value:
value = float(value.replace('K', ''))*1000
return float(value)
df['Value'] = df['Value'].apply(lambda x: value_and_wage_conversion(x))
df['Wage'] = df['Wage'].apply(lambda x: value_and_wage_conversion(x))
print(df)
Value Wage
0 200.0 200.0
1 200000000.0 200000000.0
2 200000.0 200.0
3 10000000.0 10000000.0
4 10000.0 200000000.0
.. ... ...
95 200000.0 200.0
96 200.0 200000000.0
97 10.5 200000.0
98 200000.0 10.5
99 200000000.0 10000000.0
[100 rows x 2 columns]
and check
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 100 non-null float64
1 Wage 100 non-null float64
dtypes: float64(2)
memory usage: 1.7 KB

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Dual sort based on condition within a groupby

[EDIT] Changed df size to 1k and provided piecemeal code for expected result.
Have the following df:
import random
random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': random.choices(typ, k=sz),
'sub_typ': random.choices(sub_typ, k=sz),
'col_if': random.choices(ifs, k=sz),
'col_sort': col_sort,
'value': col_val})
Would like to sort within groupby of [typ] and [sub_typ] fields, such that it sorts [col_sort] field in ascending order if [col_if] == 'A' and in descending order if [col_if] == 'D' and pick up the first 3 values of the sorted dataframe, in one line of code.
Expected result is like df_result below:
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
Tried:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort', ascending=True)
if x.col_if == 'A' else x.sort_values('col_sort',
ascending=False)).groupby(['typ', 'sub_typ', 'col_sort']).head(3)
...but it gives the error:
ValueError: The truth value of a Series is ambiguous.
Sorting per groups is same like sorting by multiple columns, but if need same output is necessary kind='mergesort'.
So for improve performance I suggest NOT sorting per groups in groupby:
np.random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': np.random.choice(typ, sz),
'sub_typ': np.random.choice(sub_typ, sz),
'col_if': np.random.choice(ifs, sz),
'col_sort': col_sort,
'value': col_val})
# print (df)
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = (df_A.sort_values(['typ', 'sub_typ','col_sort'])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_D_sorted_3 = (df_D.sort_values(['typ', 'sub_typ','col_sort'], ascending=[True, True, False])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
Compare outputs:
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
EDIT: Possible, but slow:
def f(x):
a = x[x.col_if == 'A'].sort_values('col_sort', ascending=True, kind='mergesort')
d = x[x.col_if == 'D'].sort_values('col_sort', ascending=False, kind='mergesort')
return pd.concat([a,d], sort=False)
df_result = (df.groupby(['typ', 'sub_typ','col_if'], as_index=False, group_keys=False)
.apply(f)
.groupby(['typ', 'sub_typ', 'col_sort', 'col_if'])
.head(3))
print (df_result)
typ sub_typ col_if col_sort value
242 a s1 A 0 709
535 a s1 A 0 710
589 a s1 A 0 801
111 a s1 A 1 542
209 a s1 A 1 557
.. .. ... ... ... ...
39 c s4 D 1 555
211 c s4 D 1 233
13 c s4 D 0 501
614 c s4 D 0 436
658 c s4 D 0 695
[651 rows x 5 columns]
You wrote that col_if should act as a "switch" to the sort order.
But note that each group (at least for your seeding of random) contains
both A and D in col_sort column, so your requirement is ambiguous.
One of possible solutions is to perform a "majority vote" in each group,
i.e. the sort order in particular group is to be ascending if there are
more or equal A values than D. Note that I arbitrarily chose the
ascending order in the "equal" case, maybe you should take the other option.
A doubtful point in your requirements (and hence the code) is that you
put .head(3) after the group processing.
This way you get first 3 rows from the first group only.
Maybe you want 3 initial rows from each group?
In this case head(3) should be inside the lambda function (as I wrote
below).
So change your code to:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort',
ascending=(x.col_if.eq('A').sum() >= x.col_if.eq('D').sum())).head(3))
As you can see, the sort order can be expressed as a bool expression for
ascending, instead of 2 similar expressions differing only in ascending
parameter.

'float' object has no attribute 'split'

I have a pandas data-frame with a column with float numbers. I tried to split each item in a column by dot '.'. Then I want to add first items to second items. I don't know why this sample code is not working.
data=
0 28.47000
1 28.45000
2 28.16000
3 28.29000
4 28.38000
5 28.49000
6 28.21000
7 29.03000
8 29.11000
9 28.11000
new_array = []
df = list(data)
for i in np.arange(len(data)):
df1 = df[i].split('.')
df2 = df1[0]+df[1]/60
new_array=np.append(new_array,df2)
Use numpy.modf with DataFrame constructor:
arr = np.modf(data.values)
df = pd.DataFrame({'a':data, 'b':arr[1] + arr[0] / 60})
print (df)
a b
0 28.47 28.007833
1 28.45 28.007500
2 28.16 28.002667
3 28.29 28.004833
4 28.38 28.006333
5 28.49 28.008167
6 28.21 28.003500
7 29.03 29.000500
8 29.11 29.001833
9 28.11 28.001833
Detail:
arr = np.modf(data.values)
print(arr)
(array([ 0.47, 0.45, 0.16, 0.29, 0.38, 0.49, 0.21, 0.03, 0.11, 0.11]),
array([ 28., 28., 28., 28., 28., 28., 28., 29., 29., 28.]))
print(arr[0] / 60)
[ 0.00783333 0.0075 0.00266667 0.00483333 0.00633333 0.00816667
0.0035 0.0005 0.00183333 0.00183333]
EDIT:
df = pd.DataFrame({'a':data, 'b':arr[1] + arr[0]*5/3 })
print (df)
a b
0 28.47 28.783333
1 28.45 28.750000
2 28.16 28.266667
3 28.29 28.483333
4 28.38 28.633333
5 28.49 28.816667
6 28.21 28.350000
7 29.03 29.050000
8 29.11 29.183333
9 28.11 28.183333
Your data types are floats, not strings, and so cannot be .split() (this is a string method). Instead you can look to use math.modf to 'split' a float into fractional and decimal parts
https://docs.python.org/3.6/library/math.html
import math
def process(x:float, divisor:int=60) -> float:
"""
Convert a float to its constituent parts. Divide the fractional part by the divisor, and then recombine creating a 'scaled fractional' part,
"""
b, a = math.modf(x)
c = a + b/divisor
return c
df['data'].apply(process)
Out[17]:
0 28.007833
1 28.007500
2 28.002667
3 28.004833
4 28.006333
5 28.008167
6 28.003500
7 29.000500
8 29.001833
9 28.001833
Name: data=, dtype: float64
Your other option is to convert them to strings, split, convert to ints and floats again, do some maths and then combine the floats. I'd rather keep the object as it is personally.