Series of if statements applied to data frame - pandas

I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.

Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.

Related

Pandas: take the minimum of two operations on two dataframes, while preserving index

I'm a beginner with Pandas. I've got two dataframes df1 and df2 of three columns each, labelled by some index.
I would like to get a third dataframe whose entries are
min( df1-df2, 1-df1-df2 )
for each column, while preserving the index.
I don't know how to do this on all the three columns at once. If I try e.g. np.min( df1-df2, 1-df1-df2 ) I get TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed, whereas min( df1-df2, 1-df1+df2 ) gives ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I can't use apply because I've got more than one dataframe. Basically, I would like to use something like subtract, but with the ability to define my own function.
Example: consider these two dataframes
df0 = pd.DataFrame( [[0.1,0.2,0.3], [0.3, 0.1, 0.2], [0.1, 0.3, 0.9]], index=[2,1,3], columns=['px', 'py', 'pz'] )
In [4]: df0
Out[4]:
px py pz
2 0.1 0.2 0.3
1 0.3 0.1 0.2
3 0.1 0.3 0.9
and
df1 = pd.DataFrame( [[0.9,0.1,0.9], [0.1,0.2,0.1], [0.3,0.1,0.8]], index=[3,1,2], columns=['px', 'py', 'pz'])
px py pz
3 0.9 0.1 0.9
1 0.1 0.2 0.1
2 0.3 0.1 0.8
my desired output is a new dataframe df, made up of three columns 'px', 'py', 'pz', whose entries are:
for j in range(1,4):
dfx[j-1] = min( df0['px'][j] - df1['px'][j], 1 - df0['px'][j] + df1['px'][j] )
for df['px'], and similarly for 'py' and 'pz'.
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 0.0
I hope it's clear now! Thanks in advance!
pandas is smart enough to match up the columns and index values for you in a vectorized way. If you're looping a dataframe, you're probably doing it wrong.
m1 = df0 - df1
m2 = 1 - (df0 + df1)
# Take the values from m1 where they're less than
# The corresponding value in m2. Otherwise, take m2:
out = m1[m1.lt(m2)].combine_first(m2)
# Another method: Combine our two calculated frames,
# groupby the index, and take the minimum.
out = pd.concat([m1, m2]).groupby(level=0).min()
print(out)
# Output:
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 -0.8

Change data type from object to string in pandas

This seems very simple but I just can't figure it out.
I have a dataframe with an amount column in GBpence, Euro and USD. I need a new column with the amount in GBP, the only way I can think of doing this is to first convert the amount field to string, separate the amount from the currency sign and then use numpy to perform a conditional calculation.
df = pd.DataFrame({'date': ['2018-11-22','2018-11-23','2018-11-24'],
'amount': ['3.80p','$4.50','\N{euro sign}3.40'],
'usd-gbp':['0.82','0.83','0.84'],
'eur-gbp':['0.91','0.92','0.93']})
I am trying to convert the amount column to string so I can extract the currency and float amount but it just converts all the rows into the same string.
df['new'] = str(df['amount'])
Expected output would just be the amount values in string format so I can perform slicing on them.
Any help would be appreciated. Thanks.
You can use replace to replace currency symbol by '0.82*' and '0.91*' then evaluate the operation with pd.eval:
to_gbp = {'p': '*1', '\$': '0.82*', '€': '0.91*'}
df['gbp'] = pd.eval(df['amount'].replace(to_gbp, regex=True))
print(df)
# Output:
amount usd-gbp eur-gbp gbp
0 3.80p 0.82 0.91 3.800
1 $4.50 0.82 0.91 3.690
2 €3.40 0.82 0.91 3.094
Detail about replace:
>>> df['amount'].replace(to_gbp, regex=True)
0 3.80*1
1 0.82*4.50
2 0.91*3.40
Name: amount, dtype: object
Update
I didnt mention that the exchange rates will differ based on the date. I have updated the question so show this. Is this still possible with 'replace'?
Create a custom function and apply it to each row:
def convert(row):
d = {'$': 'usd-gbp', '€': 'eur-gbp'}
c = row['amount'][0]
return float(row['amount'][1:]) * float(row[d[c]]) \
if c in d else float(row['amount'][:-1])
df['gbp'] = df.apply(convert, axis=1)
print(df)
# Output:
date amount usd-gbp eur-gbp gbp
0 2018-11-22 3.80p 0.82 0.91 3.800
1 2018-11-23 $4.50 0.83 0.92 3.735
2 2018-11-24 €3.40 0.84 0.93 3.162

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

How to use indexing by matching strings in data frame in pandas

I try to resolve the following problem. I have two data sets, say df1 and df2:
df1
NameSP Val Char1 BVA
0 'ACCR' 0.091941 A Y'
1 'SDRE' 0.001395 S Y'
2 'ACUZ' 0.121183 A N'
3 'SRRE' 0.001512 S N'
4 'FFTR' 0.035609 F N'
5 'STZE' 0.000637 S N'
6 'AHZR' 0.001418 A Y'
7 'DEES' 0.000876 D N'
8 'UURR' 0.023878 U Y'
9 'LLOH' 0.004371 L Y'
10 'IUUT' 0.049102 I N'
df2
NameSP Val1 Glob
0 'ACCR' 0.234 20000
1 'FFTR' 0.222 10000
2 'STZE' 0.001 5000
3 'DEES' 0.006 2000
4 'UURR' 0.134 20000
5 'LLOH' 0.034 10000
I would like to perform indexing of df2 in df1, and then use the indexing vector for various matrix operation. This would be something similar to strmatch(A,B,'exact') in Matlab. I can get the indexing properly by using .iloc and then .isin as in the following code:
import pandas as pd
import numpy as np
df1 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA1.xlsx')
df2 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA2.xlsx')
print(df1)
print(df2)
ddf1 = df1.iloc[:,0]
ddf2 = df2.iloc[:,0]
pindex = ddf1[ddf1.isin(ddf2)]
print(pindex.index)
which gives me:
Int64Index([0, 4, 5, 7, 8, 9], dtype='int64')
But I can not find the way to use this index for mapping and building my arrays. As an example, I would like to have a vector that has the same number of elements that df1, but with Val1 values from df2 at indexed positions and zeros everywhere else. So it should look like that:
0.234
0
0
0
0.222
0.001
0
0.006
0.134
0.034
0
Or another mapping problem. How to use such indexing to map the values from colon "Val" in df1 in a vector that would contain Val from df1 at indexed rows and zeros everywhere else. So this time it should look like:
0.091941
0.0
0.0
0.0
0.035609
0.000637
0.0
0.000876
0.023878
0.004371
0.0
Any idea of how to that in efficient and elegant way?
Thanks for help!
First problem
df2.set_index('NameSP')['Val1'].reindex(df1['NameSP']).fillna(0)
Second problem
df1['Val1'].where(df1['NameSP'].isin(df2['NameSP']), 0)

Why does a table I create from a Pandas dataframe have extra decimal places?

I have a dataframe which I want to use to create a table. The dataframe contains data in float form which I have already rounded to two decimal places. Here is a dataframe which is a subset of the dataframe I am working on:
Band R^2
0 Band2 Train 0.37
1 Band3 Train 0.50
2 Band4 Train 0.19
3 Band2 Test 0.41
4 Band3 Test 0.53
5 Band4 Test 0.12
As you can see all data in the R^2 column are rounded.
I have written the following simple code to create a table which I intend to export as a png so I can embed it in a LaTeX document. Here is the code:
ax1 = plt.subplot(111,frameon = False)
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.set_frame_on(False)
myTable = table(ax1, df)
myTable.auto_set_font_size(False)
myTable.set_fontsize(13)
myTable.scale(1.2, 3.5)
Here is the table:
Can anybody explain why the data in the R^2 are longer than two decimal places?