Pandas get_dummies for a column of lists where a cell may have no value in that column - pandas

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.

X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Related

pandas df: replace values with np.NaN if character count do not match across multiple columns

currently stuck with something I hope to find an answer for in this forum:
I have a df with multiple columns containing URLs. My index column are URLs as well.
AIM: I'd like to replace df values across all columns with np.NaN if the number of "/" (count()) in the index is not equal to the number of "/" (count()) in the values of each individual of of the other columns
E.x.
First, you need one column to compare to.
counts = df['id_url'].str.count('/')
Then you evaluate all the rows at once.
mask = df.str.count('/') == counts
Then we want to to show rows where all the values are equal.
mask = mask.all(axis=1)
Now we have a mask for where every value is equal, we can use the not operator to filter for those where at least one column is not equal.
df.loc[~mask, :] = np.nan # replaces every value in the row with np.nan

Parse dictionary inside dataframe

One column of my df has either 1.a nested dictionary or 2. NAN as value
The dicts has 2 key-value pairs like this one
{'value': '1', 'info': {....}}
I wish to only get the value of “value”, the value of “info” is not useful, we can leave “NAN” if it is NAN value
What is the easiest way to achieve this?
BTW I tried df_september_p1['that_column_name']==np.nan
and df_september_p1['that columnname']==’nan’,
which yield the same Boolean values. The weird thing is I see the 2nd row has NAN as value but the yield result is False for 2nd row… don’t get why
You can use Series.str.get working well with dictioanries or with missing values NaNs:
df_september_p1['val'] = df_september_p1['that_column_name'].str.get('value')

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

Fast remove element of list if contained by pandas dataframe

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!
I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

How to change a value in a column based on whether or not a certain string combination is in other columns in the same row? (Pandas)

I am a very new newbie to Pandas and programming in general. I'm using Anaconda, if that matters.
I have the following on my hands:
The infamous Titanic survival dataset.
So, my idea was to search the dataframe, find the rows where in the "Name" column there would be a string "Mrs." AND at the same time the "age" would be a NaN (in which case the value in the "Age" column needs to be changed to 32). Also, finding "Miss"in the cell, values in two other columns are zeros.
My major problem is that I don't know how to tell Pandas to replace the value in the same row or delete the whole row.
#I decided to collect the indexes of rows with the "Age" value == NaN to further use the
#indices to search through the "Names column."
list_of_NaNs = df[df['Age'].isnull()].index.tolist()
for name in df.Name:
if "Mrs." in name and name (list_of_NaNs):#if the string combination "Mrs."
#can be found within the cell...
df.loc['Age'] = 32.5 #need to change the value in the
#column IN THE SAME ROW
elif "Miss" in name and df.loc[Parch]>0: #how to make a
#reference to a value IN THE SAME ROW???
df.loc["Age"] = 5
elif df.SibSp ==0 and Parch ==0:
df.loc["Age"] = 32.5
else:
#mmm... how do I delete entire row so that it doesn't
#interfere with my future actions?
Here is how you can test if 'Miss' or 'Mrs.'is present in name columns:
df.name.str.contains('Mrs')
So following will give you the rows where 'Mrs' is in name and Age is NaN
df[(df.name.str.contains('Mrs')) & (df.age.isna())]
You can play with different cases and tasks from here on.
Hope this helps :)
And to drop rows with NaN in age column:
df = df.drop(df[df.age.isna()].index)