Assigning index column to empty pandas dataframe - pandas

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?

The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0

yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df

Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

Fillna() depending on another column

I want to do next:
Fill DF1 NaN with values from DF2 depending on column value in DF1.
Basically, DF1 has people with "income_type" and some NaN in "total_income". In DF2 there are "median income" for each "income_type". I want to fill NaN in "total_income" DF1 with median values from DF2
DF1, DF2
First, I would merge values from DF2 to DF1 by 'income_type'
DF3 = DF1.merge(DF2, how='left', on='income_type')
This way you have the values of median income and total income in the same dataframe.
After this, I would do an if else statement for a pandas dataframe columns
DF3.loc[DF3['total_income'].isna(), 'total_income'] = DF3['median income']
That will replace the NaN values with the median values from the merge
You need to join the two dataframes and then replace the nan values with the median. Here is a similar working example. Cheers mate.
import pandas as pd
#create the example dataframes
df1 = pd.DataFrame({'income_type':['a','b','c','a','a','b','b'], 'total_income':[200, 300, 500,None,400,None,None]})
df2 = pd.DataFrame({'income_type':['a','b','c'], 'median_income':[205, 305, 505]})
# inner join the df1 with df2 on the column 'income_type'
joined = df1.merge(df2, on='income_type')
# fill the nan values the value from the column 'median_income' and save it in a new column 'total_income_not_na'
joined['total_income_not_na'] = joined['total_income'].fillna(joined['median_income'])
print(joined)

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))