Replace selected values of one column with median value of another column but with condition - pandas

So, I hope you know the famous Titanic question. This is what I did so far by learning the tutorial. Now I want to replace NaN values of column: Age with median values of part of Age column. But the selected part should have a certain value for "Title"
For example, I want to replace NaN of Age where Title="Mr", so the median value for "Mr" would be filled in missing places which also has Title=="Mr".
I tried this:
for val in data["Title"].unique():
median_age = data.loc[data.Title == val, "Age"].median()
data.loc[data.Title == val, "Age"].fillna(median_age, inplace=True)
But still Age shows up as NaN. How can I do this?

Use combine_first to fill NaN. I have no column Title from my dataset but it's the same:
df['Age'] = df['Age'].combine_first(df.groupby('Sex')['Age'].transform('median'))

Related

NaN output when multiplying row and column of dataframe in pandas

I have two data frames the first one looks like this:
and the second one like so:
I am trying to multiply the values in number of donors column of the second data frame(96 values) with the values in the first row of the first data frame and columns 0-95 (also 96 values).
Below is the code I have for multiplying the two right now, but as you can see the values are all NaN:
Does anyone know how to fix this?
Your second dataframe has dtype object, you must convert it to float
df_sls.iloc[0,3:-1].astype(float)

Is there a pandas function for get variables names in a column?

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!
You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()
You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Is there a way in pandas to compare two values within one column and sum how many times the second value is greater?

In my pandas dataframe, I have one column, score, thats rows are values such as [80,100], [90,100], etc. what I want to do is go through this column and if the second value in the list is greater than the first value, then to count that. so that I have a value that sums the number of times where in [a,b], b was greater. how would I do this?
print(len([x for x in df['score'] if x[1] > x[0]]))

How to change a value in a column based on whether or not a certain string combination is in other columns in the same row? (Pandas)

I am a very new newbie to Pandas and programming in general. I'm using Anaconda, if that matters.
I have the following on my hands:
The infamous Titanic survival dataset.
So, my idea was to search the dataframe, find the rows where in the "Name" column there would be a string "Mrs." AND at the same time the "age" would be a NaN (in which case the value in the "Age" column needs to be changed to 32). Also, finding "Miss"in the cell, values in two other columns are zeros.
My major problem is that I don't know how to tell Pandas to replace the value in the same row or delete the whole row.
#I decided to collect the indexes of rows with the "Age" value == NaN to further use the
#indices to search through the "Names column."
list_of_NaNs = df[df['Age'].isnull()].index.tolist()
for name in df.Name:
if "Mrs." in name and name (list_of_NaNs):#if the string combination "Mrs."
#can be found within the cell...
df.loc['Age'] = 32.5 #need to change the value in the
#column IN THE SAME ROW
elif "Miss" in name and df.loc[Parch]>0: #how to make a
#reference to a value IN THE SAME ROW???
df.loc["Age"] = 5
elif df.SibSp ==0 and Parch ==0:
df.loc["Age"] = 32.5
else:
#mmm... how do I delete entire row so that it doesn't
#interfere with my future actions?
Here is how you can test if 'Miss' or 'Mrs.'is present in name columns:
df.name.str.contains('Mrs')
So following will give you the rows where 'Mrs' is in name and Age is NaN
df[(df.name.str.contains('Mrs')) & (df.age.isna())]
You can play with different cases and tasks from here on.
Hope this helps :)
And to drop rows with NaN in age column:
df = df.drop(df[df.age.isna()].index)