Iterating over rows and columns in Pandas - pandas

I am trying to fill mean values of columns for all NaNs values in the column.
import numpy as np
import pandas as pd
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(table):
for column in table:
for value in column:
if value == 'NaN':
value = column.mean(skipna=True)
else:
value = value
impute_missing_values(table)
table
Why I am getting an error for this code?

IIUC:
table.fillna(table.mean())
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6

Okay, I am adding this as another answer because this isn't something I recommend at all. Using pandas methods vectorizes operations for better performance.
Using loops is not recommended when possible to avoid.
However, here is a quick fix to your code:
import pandas as pd
import numpy as np
import math
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(df):
for column in df:
for idx, value in df[column].iteritems():
if math.isnan(value):
df.loc[idx,column] = df[column].mean(skipna=True)
else:
pass
return df
impute_missing_values(table)
table
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6

You can try the SimpleImputer from scikit learn (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) using the mean option.
import pandas as pd
from sklearn.impute import SimpleImputer
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
print(table, '\n')
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
table_means = pd.DataFrame(imp.fit_transform(table), columns = {'C','B','A'})
print(table_means)
The print commands results in:
A B C
0 1.0 3.0 4
1 2.0 NaN 5
2 NaN NaN 6
A C B
0 1.0 3.0 4.0
1 2.0 3.0 5.0
2 1.5 3.0 6.0
To correct your code (as per my comment below):
def impute_missing_values(table):
for column in table:
table.loc[:,column] = np.where(table[column].isna(), table[column].mean(), table[column])
return table

Related

Pandas pivot table with prefix to columns

I have a dataframe:
df = C1 A1. A2. A3. Type
A 1. 5. 2. AG
A 7. 3. 8. SC
And I want to create:
df = C1 A1_AG A1_SC A2_AG A2_SC
A 1. 7. 5. 3
How can it be done?
You can rather use a melt and transpose:
(df.melt('Type')
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.set_index('col')[['value']].T
)
Output:
col AG_A1 SC_A1 AG_A2 SC_A2 AG_A3 SC_A3
value 1 7 5 3 2 8
with additional columns(s):
(df.melt(['C1', 'Type'])
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.pivot(index=['C1'], columns='col', values='value')
.reset_index()
)
Output:
col C1 AG_A1 AG_A2 AG_A3 SC_A1 SC_A2 SC_A3
0 A 1 5 2 7 3 8
Use DataFrame.set_index with DataFrame.unstack:
df = df.set_index(['C1','Type']).unstack()
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
One convenience option with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_wider(index = 'C1', names_from = 'Type')
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
Of course, you can skip the convenience function and use pivot directly:
result = df.pivot(index='C1', columns='Type')
result.columns = result.columns.map('_'.join)
result.reset_index()
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0

Pandas / Numpy conditional calculation with NaN values

I'm dealing with incomplete data and would like to assign scoring to different rows.
For example:
Bluetooth and WLAN are non integers but I would like to assign the value of 1 if data is available. 0 if there is no data (or NaN).
Samsung's score would be 1 + 1 + 4 = 6
Nokia's score would be 0 + 0 + 5 = 5
Bluetooth WLAN Rating Score
Apple Class-A USB-A NaN
Samsung Class-B USB-B 4
Nokia NaN NaN 5
I'm using Pandas at the moment but I'm not sure if Pandas alone is capable without Numpy.
Thanks a lot!
import pandas as pd
import numpy as np
data = {'Bluetooth': ['class-A', 'class-B', np.nan], 'WLAN': ['usb-A', 'usb-B', np.nan],'Rating': [np.nan, 4, 5]}
df = pd.DataFrame(data)
df = df.replace(np.nan, 0)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(1)
df['score'] = df.sum(axis=1)
print(df.head())
Output:
Bluetooth WLAN Rating score
0 1.0 1.0 0.0 2.0
1 1.0 1.0 4.0 6.0
2 0.0 0.0 5.0 5.0
try this :
import pandas as pd
import numpy as np
df['Nan_count']=df.isnull().sum(axis=1)
df['score']=-df['Nan_count']+df['Rating'].replace(np.nan,0)+2
With this solution we do need to change the Nan in our dataframe et as computation is pretty low also

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

pandas groupby and agg operation of selected columns and row

I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group
If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0