Converting a string to number in jupyter - pandas

Here is my code:
def value_and_wage_conversion(value):
if isinstance(value,str):
if 'M' in out:
out = float(out.replace('M', ''))*1000000
elif 'K' in value:
out = float(out.replace('K', ''))*1000
return float(out)
fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
Here is the error message:
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call
last) in
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
c:\users\brain\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py
in apply(self, func, convert_dtype, args, **kwds) 4136
else: 4137 values = self.astype(object)._values
-> 4138 mapped = lib.map_infer(values, f, convert=convert_dtype) 4139 4140 if len(mapped) and
isinstance(mapped[0], Series):
pandas_libs\lib.pyx in pandas._libs.lib.map_infer()
in (x)
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
in value_and_wage_conversion(value)
1 def value_and_wage_conversion(value):
2 if isinstance(value,str):
----> 3 if 'M' in out:
4 out = float(out.replace('M', ''))*1000000
5 elif 'K' in value:
UnboundLocalError: local variable 'out' referenced before assignment

You were almost there but you need to fix your function
For example
import numpy as np
import pandas as pd
# generate a random sample
values = ['10M', '10K', 10.5, '200M', '200K', 200]
size = 100
np.random.seed(1)
df = pd.DataFrame({
'Value': np.random.choice(values, size),
'Wage': np.random.choice(values, size),
})
print(df)
Value Wage
0 200 200
1 200M 200M
2 200K 200
3 10M 10M
4 10K 200M
.. ... ...
95 200K 200
96 200 200M
97 10.5 200K
98 200K 10.5
99 200M 10M
[100 rows x 2 columns]
Define function and apply
def value_and_wage_conversion(value):
if isinstance(value, str):
if 'M' in value:
value = float(value.replace('M', ''))*1000000
elif 'K' in value:
value = float(value.replace('K', ''))*1000
return float(value)
df['Value'] = df['Value'].apply(lambda x: value_and_wage_conversion(x))
df['Wage'] = df['Wage'].apply(lambda x: value_and_wage_conversion(x))
print(df)
Value Wage
0 200.0 200.0
1 200000000.0 200000000.0
2 200000.0 200.0
3 10000000.0 10000000.0
4 10000.0 200000000.0
.. ... ...
95 200000.0 200.0
96 200.0 200000000.0
97 10.5 200000.0
98 200000.0 10.5
99 200000000.0 10000000.0
[100 rows x 2 columns]
and check
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 100 non-null float64
1 Wage 100 non-null float64
dtypes: float64(2)
memory usage: 1.7 KB

Related

python pandas divide dataframe in method chain

I want to divide a dataframe by a number:
df = df/10
Is there a way to do this in a method chain?
# idea:
df = df.filter(['a','b']).query("a>100").assign(**divide by 10)
We can use DataFrame.div here:
df = df[['a','b']].query("a>100").div(10)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Use DataFrame.pipe with lambda function for use some function for all data of DataFrame:
df = pd.DataFrame({
'a':[400,500,40,50,5,700],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4]
})
df = df.filter(['a','b']).query("a>100").pipe(lambda x: x / 10)
print (df)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Here if use apply all columns are divided separately:
df = df.filter(['a','b']).query("a>100").apply(lambda x: x / 10)
You can see difference with print:
df1 = df.filter(['a','b']).query("a>100").pipe(lambda x: print (x))
a b
0 400 7
1 500 8
5 700 3
df2 = df.filter(['a','b']).query("a>100").apply(lambda x: print (x))
0 400
1 500
5 700
Name: a, dtype: int64
0 7
1 8
5 3
Name: b, dtype: int64

Dual sort based on condition within a groupby

[EDIT] Changed df size to 1k and provided piecemeal code for expected result.
Have the following df:
import random
random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': random.choices(typ, k=sz),
'sub_typ': random.choices(sub_typ, k=sz),
'col_if': random.choices(ifs, k=sz),
'col_sort': col_sort,
'value': col_val})
Would like to sort within groupby of [typ] and [sub_typ] fields, such that it sorts [col_sort] field in ascending order if [col_if] == 'A' and in descending order if [col_if] == 'D' and pick up the first 3 values of the sorted dataframe, in one line of code.
Expected result is like df_result below:
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False)).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
Tried:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort', ascending=True)
if x.col_if == 'A' else x.sort_values('col_sort',
ascending=False)).groupby(['typ', 'sub_typ', 'col_sort']).head(3)
...but it gives the error:
ValueError: The truth value of a Series is ambiguous.
Sorting per groups is same like sorting by multiple columns, but if need same output is necessary kind='mergesort'.
So for improve performance I suggest NOT sorting per groups in groupby:
np.random.seed(1234)
sz = 1000
typ = ['a', 'b', 'c']
sub_typ = ['s1', 's2', 's3', 's4']
ifs = ['A', 'D']
col_sort = np.random.randint(0, 10, size=sz)
col_val = np.random.randint(100, 1000, size=sz)
df = pd.DataFrame({'typ': np.random.choice(typ, sz),
'sub_typ': np.random.choice(sub_typ, sz),
'col_if': np.random.choice(ifs, sz),
'col_sort': col_sort,
'value': col_val})
# print (df)
df_A = df[df.col_if == 'A']
df_D = df[df.col_if == 'D']
df_A_sorted_3 = (df_A.sort_values(['typ', 'sub_typ','col_sort'])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_D_sorted_3 = (df_D.sort_values(['typ', 'sub_typ','col_sort'], ascending=[True, True, False])
.groupby(['typ', 'sub_typ', 'col_sort'])
.head(3))
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
Compare outputs:
df_A_sorted_3 = df_A.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=True, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_D_sorted_3 = df_D.groupby(['typ', 'sub_typ'], as_index=False).apply(lambda x:
x.sort_values('col_sort', ascending=False, kind='mergesort')).\
groupby(['typ', 'sub_typ', 'col_sort']).head(3)
df_result = pd.concat([df_A_sorted_3, df_D_sorted_3]).reset_index(drop=True)
print (df_result)
typ sub_typ col_if col_sort value
0 a s1 A 0 709
1 a s1 A 0 710
2 a s1 A 0 801
3 a s1 A 1 542
4 a s1 A 1 557
.. .. ... ... ... ...
646 c s4 D 1 555
647 c s4 D 1 233
648 c s4 D 0 501
649 c s4 D 0 436
650 c s4 D 0 695
[651 rows x 5 columns]
EDIT: Possible, but slow:
def f(x):
a = x[x.col_if == 'A'].sort_values('col_sort', ascending=True, kind='mergesort')
d = x[x.col_if == 'D'].sort_values('col_sort', ascending=False, kind='mergesort')
return pd.concat([a,d], sort=False)
df_result = (df.groupby(['typ', 'sub_typ','col_if'], as_index=False, group_keys=False)
.apply(f)
.groupby(['typ', 'sub_typ', 'col_sort', 'col_if'])
.head(3))
print (df_result)
typ sub_typ col_if col_sort value
242 a s1 A 0 709
535 a s1 A 0 710
589 a s1 A 0 801
111 a s1 A 1 542
209 a s1 A 1 557
.. .. ... ... ... ...
39 c s4 D 1 555
211 c s4 D 1 233
13 c s4 D 0 501
614 c s4 D 0 436
658 c s4 D 0 695
[651 rows x 5 columns]
You wrote that col_if should act as a "switch" to the sort order.
But note that each group (at least for your seeding of random) contains
both A and D in col_sort column, so your requirement is ambiguous.
One of possible solutions is to perform a "majority vote" in each group,
i.e. the sort order in particular group is to be ascending if there are
more or equal A values than D. Note that I arbitrarily chose the
ascending order in the "equal" case, maybe you should take the other option.
A doubtful point in your requirements (and hence the code) is that you
put .head(3) after the group processing.
This way you get first 3 rows from the first group only.
Maybe you want 3 initial rows from each group?
In this case head(3) should be inside the lambda function (as I wrote
below).
So change your code to:
df.groupby(['typ', 'sub_typ']).apply(lambda x: x.sort_values('col_sort',
ascending=(x.col_if.eq('A').sum() >= x.col_if.eq('D').sum())).head(3))
As you can see, the sort order can be expressed as a bool expression for
ascending, instead of 2 similar expressions differing only in ascending
parameter.

I am sure that the type of "items_tmp_dic2" is dict,so why report this error?

import pandas as pd
import numpy as np
path = 'F:/datasets/kaggle/predict_future_sales/'
train_raw = pd.read_csv(path + 'sales_train.csv')
items = pd.read_csv(path + 'items.csv')
item_category_id = items['item_category_id']
item_id = train_raw.item_id
train_raw.head()
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
items.head()
item_name item_id item_category_id
0 ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D 0 40
1 !ABBYY FineReader 12 Professional Edition Full... 1 76
2 ***В ЛУЧАХ СЛАВЫ (UNV) D 2 40
3 ***ГОЛУБАЯ ВОЛНА (Univ) D 3 40
4 ***КОРОБКА (СТЕКЛО) D 4 40
Then I want to add a "item_category_id" to train_raw,you mean from the data of items,so i want to creat a dict of item_id and item_category_id
item_category_id = items['item_category_id']
item_id = train_raw.item_id
items_tmp = items.drop(['item_name'],axis=1)
items_tmp_dic = items_tmp.to_dict('split')
items_tmp_dic = items_tmp_dic.get('data')
items_tmp_dic2 = dict(items_tmp_dic)
ic_id = []
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(i))
print(len(ic_id))
wrong
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-be637620ea6d> in <module>
6 ic_id = []
7 for i in np.nditer(item_id.values[:10]):
----> 8 ic_id.append(items_tmp_dic2.get(i))
9 print(len(ic_id))
TypeError: unhashable type: 'numpy.ndarray'
but when I run
for i in np.nditer(item_id.values[:10]):
print(i)
I get
22154
2552
2552
2554
2555
2564
2565
2572
2572
2573
I have ensured that the type of "items_tmp_dic2" is dict,so why ?
I have solved it by using int()
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(int(i)))

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?
you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}