Pythonic style of writing a "for-loop" with "if" clause - pandas

I use Java and I'm new to python.
I have the following code snippet:
count_of_yes = 0
for str_idx in str_indexes: # Ex. ["abc", "bbb","cb","aaa"]
if "a" in str_idx:
count_of_yes += one_dict["data_frame_of_interest"].loc[(str_idx), 'yes_column']
The one_dict looks like:
# categorical, can only be 1 or 0 in either column
one_dict --> data_frame_of_interest --> ______|__no_column__|__yes_column__
"abc" | 1.0 | 0.0
\ "cb" | 1.0 | 0.0
\ "aaab"| 0.0 | 1.0
\ "bb" | 0.0 | 1.0
\ ...
\
other_dfs_dont_need --> ...
...
I'm trying to get count_of_yes, is there a better pythonic way to refactor the above for-loop and calculate the sum of count_of_yes?
Thanks!

Related

Dealing with multiple values in Pandas Dataframe Cell

Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.
Column Name: solution_id | type labour | labour_unit | est_labour_quantity | est_labour_costs | est_labour_total_costs
10 | WorkA#WorkB | Person#Person | 2.0#2.0 | 300.0#300.0. | 600.0#600.0
11 | WorkC#WorkD | Person#Person | 3.0#2.0 | 300.0#300.0. | 900.0#600.0
My questions are twofold:
What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.
Column Name: solution_id | type labour_1 | labour_unit_1 | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2 | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2
10 | WorkA | Person. | 2.0. | 300.0. | 600.0. | WorkB | Person | 2.0 | 300.0 | 600.0
11 | WorkC | Person. | 3.0. | 300.0. | 900.0. | WorkD | Person | 2.0 | 300.0 | 600.0
This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.
How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...
If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!
Use:
#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)
#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
.sort_index(level=1, axis=1, sort_remaining=False))
#flatten MultiIndex
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
type_labour_1 labour_unit_1 est_labour_quantity_1 \
solution_id
10 WorkA Person 2.0
11 WorkC Person 3.0
est_labour_costs_1 est_labour_total_costs_1 type_labour_2 \
solution_id
10 300.0 600.0 WorkB
11 300.0 900.0 WorkD
labour_unit_2 est_labour_quantity_2 est_labour_costs_2 \
solution_id
10 Person 2.0 300.0.
11 Person 2.0 300.0.
est_labour_total_costs_2
solution_id
10 600.0
11 600.0
You can split your strings, explode and reshape:
df2 = (df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
.explode(list(df.columns[1:]))
.assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
.set_index('idx', append=True)
.unstack('idx')
.sort_index(axis=1, level='idx', sort_remaining=False)
)
df2.columns = [f'{a}_{b}' for a,b in df2.columns]
output:
type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
solution_id
10 WorkA Person 2.0 300.0 600.0 WorkB Person 2.0 300.0. 600.0
11 WorkC Person 3.0 300.0 900.0 WorkD Person 2.0 300.0. 600.0
Or, shorter code using the same initial split followed by slicing and concatenation:
df2=(df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
)
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i+1}')
for i in range(len(df2.iat[0,0]))], axis=1)

How come apply on multiple columns in dataframe does not work?

I am trying to remove '$' sign and convert value to a float for multiple columns in a dataframe.
I have a dataframe that looks something like this:
policy_status sum_assured premium riders premium_plus
0 A 1252000 $ 1500 $ 1.0 1100 $
1 A 1072000 $ 2200 $ 2.0 1600 $
2 A 1274000 $ 1700 $ 2.0 1300 $
3 A 1720000 $ 2900 $ 1.0 1400 $
4 A 1360000 $ 1700 $ 3.0 1400 $
I have this function:
def transform_amount(x):
x=x.replace('$','')
x=float(x)
return x
when I do this:
policy[['sum_assured','premium','premium_plus']]=policy[['sum_assured','premium','premium_plus']].apply(transform_amount)
the following error occured:
TypeError: ("cannot convert the series to <class 'float'>", 'occurred at index sum_assured')
Does anyone know why?
If need working elementwise use DataFrame.applymap:
cols = ['sum_assured','policy_premium','rider_premium']
policy[cols]=policy[cols]].applymap(transform_amount)
Better is use DataFrame.replace with regex=True, but first escape $ because special regex value and convert columns to floats:
cols = ['sum_assured','premium','premium_plus']
policy[cols]=policy[cols].replace('\$','', regex=True).astype(float)
print (policy)
policy_status sum_assured premium riders premium_plus
0 A 1252000.0 1500.0 1.0 1100.0
1 A 1072000.0 2200.0 2.0 1600.0
2 A 1274000.0 1700.0 2.0 1300.0
3 A 1720000.0 2900.0 1.0 1400.0
4 A 1360000.0 1700.0 3.0 1400.0

Calculate nearest distance to certain points in python

I have a dataset as shown below, each sample has x and y values and the corresponding result
Sr. X Y Resut
1 2 12 Positive
2 4 3 positive
....
Visualization
Grid size is 12 * 8
How I can calculate the nearest distance for each sample from red points (positive ones)?
Red = Positive,
Blue = Negative
Sr. X Y Result Nearest-distance-red
1 2 23 Positive ?
2 4 3 Negative ?
....
dataset
Its a lot easier when there is sample data, make sure to include that next time.
I generate random data
import numpy as np
import pandas as pd
import sklearn
x = np.linspace(1,50)
y = np.linspace(1,50)
GRID = np.meshgrid(x,y)
grid_colors = 1* ( np.random.random(GRID[0].size) > .8 )
sample_data = pd.DataFrame( {'X': GRID[0].flatten(), 'Y':GRID[1].flatten(), 'grid_color' : grid_colors})
sample_data.plot.scatter(x="X",y='Y', c='grid_color', colormap='bwr', figsize=(10,10))
BallTree (or KDTree) can create a tree to query with
from sklearn.neighbors import BallTree
red_points = sample_data[sample_data.grid_color == 1]
blue_points = sample_data[sample_data.grid_color != 1]
tree = BallTree(red_points[['X','Y']], leaf_size=15, metric='minkowski')
and use it with
distance, index = tree.query(sample_data[['X','Y']], k=1)
now add it to the DataFrame
sample_data['nearest_point_distance'] = distance
sample_data['nearest_point_X'] = red_points.X.values[index]
sample_data['nearest_point_Y'] = red_points.Y.values[index]
which gives
X Y grid_color nearest_point_distance nearest_point_X \
0 1.0 1.0 0 2.0 3.0
1 2.0 1.0 0 1.0 3.0
2 3.0 1.0 1 0.0 3.0
3 4.0 1.0 0 1.0 3.0
4 5.0 1.0 1 0.0 5.0
nearest_point_Y
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Modification to have red point not find themself;
Find the nearest k=2 instead of k=1;
distance, index = tree.query(sample_data[['X','Y']], k=2)
And, with help of numpy indexing, make red points use the second instead of the first found;
sample_size = GRID[0].size
sample_data['nearest_point_distance'] = distance[np.arange(sample_size),sample_data.grid_color]
sample_data['nearest_point_X'] = red_points.X.values[index[np.arange(sample_size),sample_data.grid_color]]
sample_data['nearest_point_Y'] = red_points.Y.values[index[np.arange(sample_size),sample_data.grid_color]]
The output type is the same, but due to randomness it won't agree with earlier made picture.
cKDTree for scipy can calculate that distance for you. Something along those lines should work:
df['Distance_To_Red'] = cKDTree(coordinates_of_red_points).query((df['x'], df['y']), k=1)

remove redundant signals in pandas

I want to build correspondance between col1 and col2 with certain rule.
Label1 is like an on switch, and label2 is like an off switch. Once label1 is on, further operation on label1 will not re-open the switch until it is switched off by label2. Then label1 can switch on again.
For example, I have a following table:
index label1 label2 note
1 F T label2 is invalid because not switch on yet
2 T F label1 switch on
3 F F
4 T F useless action because it's on already
5 F T switch off
6 F F
7 T F switch on
8 F F
9 F T switch off
10 F F
11 F T invalid off operation, not on
The correct output is something like:
label1ix label2ix
2 5
7 9
What I tries is :
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
mask = (df['label1'] == T) # label1==True, then get the index and label2ix
newdf = pd.Dataframe(df.loc[mask, ['index', 'label2ix']])
This is not correct because I have got is:
label1ix label2ix note
2 5 correct
4 5 wrong operation
7 9 correct
I am not sure how to filter out row 4.
I have got another idea,
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
groups = df.groupby('label2ix')
firstlabel1 = groups['label1'].first()
But for this solution, I don't know how to get the first label1=T in each group.
And I am not sure if there is any more efficient way to do that? Grouping is usually slow.
Not tested yet, but here are few things you can try:
Option 1: For the first approach, you can filter out the 4 by:
newdf.groupby('label2ix').min()
but this approach might not work with more general data.
Option 2: This might work better in general:
# copy all on and off switches to a common column
# 0 - off, 1 - on
df['state'] = np.select([df.label1=='T', df.label2=='T'], [1,0], default=np.nan)
# ffill will fill the na with the state before it
# until changed by a new switch
df['state'] = df['state'].ffill().fillna(0)
# mark the changes of states
df['change'] = df['state'].diff()
At this point, df will be:
index label1 label2 state change
0 1 F T 0.0 NaN
1 2 T F 1.0 1.0
2 3 F F 1.0 0.0
3 4 T F 1.0 0.0
4 5 F T 0.0 -1.0
5 6 F F 0.0 0.0
6 7 T F 1.0 1.0
7 8 F F 1.0 0.0
8 9 F T 0.0 -1.0
9 10 F F 0.0 0.0
10 11 F T 0.0 0.0
which should be easy to track all the state changes:
switch_ons = df.loc[df['change'].eq(1), 'index']
switch_offs = df.loc[df['change'].eq(-1), 'index']
# return df
new_df = pd.DataFrame({'label1ix':switch_ons.values,
'label2ix':switch_offs.values})
and output:
label1ix label2ix
0 2 5
1 7 9

pandas - how to convert all columns from object to float type

I trying to convert all columns with '$' amount from object to float type.
With below code i couldnt remove the $ sign.
input:
df[:] = df[df.columns.map(lambda x: x.lstrip('$'))]
You can using extract
df=pd.DataFrame({'A':['$10.00','$10.00','$10.00']})
df.apply(lambda x : x.str.extract('(\d+)',expand=False).astype(float))
Out[333]:
A
0 10.0
1 10.0
2 10.0
Update
df.iloc[:,9:32]=df.iloc[:,9:32].apply(lambda x : x.str.extract('(\d+)',expand=False).astype(float))
May be you can also try using applymap:
df[:] = df.astype(str).applymap(lambda x: x.lstrip('$')).astype(float)
If df is:
0 1 2
0 $1 7 5
1 $2 7 9
2 $3 7 9
Then, it will result in:
0 1 2
0 1.0 7.0 5.0
1 2.0 7.0 9.0
2 3.0 7.0 9.0
Please use the below regular expression based matching to replace all occurrences of $ with null character
df = df.replace({'\$': ''}, regex=True)
UPDATE: As per #Wen suggestion, the solution will be
df.iloc[:,9:32]=df.iloc[:,9:32].replace({'\$':''},regex=True).astype(float)