How to change a value in a column based on whether or not a certain string combination is in other columns in the same row? (Pandas) - pandas

I am a very new newbie to Pandas and programming in general. I'm using Anaconda, if that matters.
I have the following on my hands:
The infamous Titanic survival dataset.
So, my idea was to search the dataframe, find the rows where in the "Name" column there would be a string "Mrs." AND at the same time the "age" would be a NaN (in which case the value in the "Age" column needs to be changed to 32). Also, finding "Miss"in the cell, values in two other columns are zeros.
My major problem is that I don't know how to tell Pandas to replace the value in the same row or delete the whole row.
#I decided to collect the indexes of rows with the "Age" value == NaN to further use the
#indices to search through the "Names column."
list_of_NaNs = df[df['Age'].isnull()].index.tolist()
for name in df.Name:
if "Mrs." in name and name (list_of_NaNs):#if the string combination "Mrs."
#can be found within the cell...
df.loc['Age'] = 32.5 #need to change the value in the
#column IN THE SAME ROW
elif "Miss" in name and df.loc[Parch]>0: #how to make a
#reference to a value IN THE SAME ROW???
df.loc["Age"] = 5
elif df.SibSp ==0 and Parch ==0:
df.loc["Age"] = 32.5
else:
#mmm... how do I delete entire row so that it doesn't
#interfere with my future actions?

Here is how you can test if 'Miss' or 'Mrs.'is present in name columns:
df.name.str.contains('Mrs')
So following will give you the rows where 'Mrs' is in name and Age is NaN
df[(df.name.str.contains('Mrs')) & (df.age.isna())]
You can play with different cases and tasks from here on.
Hope this helps :)
And to drop rows with NaN in age column:
df = df.drop(df[df.age.isna()].index)

Related

Dataframe drop rows that does not contain specific values in the column, result in empty df

I want to drop rows in housing_plot_end that does not contain one of 5 values specified in plot_test_data['housing price median'] dataframe object.
'Dł. geograficzna' is the name of column and it translates to 'latitiude' in english but I left it as it was because maybe the space between these two words is causing a problem?
But I am receiving empty df with:
values_to_save = [plot_test_data['Dł. geograficzna']]
housing_plot_end=housing_plot_end[~housing_plot_end['Dł. geograficzna'].isin(values_to_save)
== False]
enter code here
The column in plot_test_data contains of 5 numerical values thus 5 rows:
-121.46
-117.23
-119.04
-117.13
-118.7
Meanwhile housing_plot_end has tens of thousands of rows and I need to drop every row which does not contain one of these specific values in the column of housing_plot_end['Dł. geograficzna']
But I am receiving empty dataframe object when I've runned this code:
values_to_save = [plot_test_data['Dł. geograficzna']]
housing_plot_end=housing_plot_end[~housing_plot_end['Dł. geograficzna'].isin(values_to_save)
== False]
I don't know what to do.

Replace selected values of one column with median value of another column but with condition

So, I hope you know the famous Titanic question. This is what I did so far by learning the tutorial. Now I want to replace NaN values of column: Age with median values of part of Age column. But the selected part should have a certain value for "Title"
For example, I want to replace NaN of Age where Title="Mr", so the median value for "Mr" would be filled in missing places which also has Title=="Mr".
I tried this:
for val in data["Title"].unique():
median_age = data.loc[data.Title == val, "Age"].median()
data.loc[data.Title == val, "Age"].fillna(median_age, inplace=True)
But still Age shows up as NaN. How can I do this?
Use combine_first to fill NaN. I have no column Title from my dataset but it's the same:
df['Age'] = df['Age'].combine_first(df.groupby('Sex')['Age'].transform('median'))

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

How to a row in pandas based on column condition?

I have a pandas data frame and I would like to duplicate those rows which meet some column condition (i.e. having multiple elements in CourseID column)
I tried iterating over the data frame to identify the rows which should be duplicated but i don't know how to duplicate them,
Using Pandas version 0.25 it is quite easy:
The first step is to split df.CourseID (converting each element to a list)
and then to explode it (break each list into multiple rows,
repeating other columns in each row):
course = df.CourseID.str.split(',').explode()
The result is:
0 456
1 456
1 799
2 789
Name: CourseID, dtype: object
Then, all to do is to join df with course, but in order to avoid
repeating column names, you have to drop original CourseID column before.
Fortunately, in can be expressed in a single instruction:
df.drop(columns=['CourseID']).join(course)
If you have some older version of Pandas this is a good reason to
upgrade it.