Parse dictionary inside dataframe - pandas

One column of my df has either 1.a nested dictionary or 2. NAN as value
The dicts has 2 key-value pairs like this one
{'value': '1', 'info': {....}}
I wish to only get the value of “value”, the value of “info” is not useful, we can leave “NAN” if it is NAN value
What is the easiest way to achieve this?
BTW I tried df_september_p1['that_column_name']==np.nan
and df_september_p1['that columnname']==’nan’,
which yield the same Boolean values. The weird thing is I see the 2nd row has NAN as value but the yield result is False for 2nd row… don’t get why

You can use Series.str.get working well with dictioanries or with missing values NaNs:
df_september_p1['val'] = df_september_p1['that_column_name'].str.get('value')

Related

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Row wise average return nan

Here is my data frame
Where I wrote 1 or 2, I would like to get the mean/median of the previous column.
For instance, for DXC.N, the expected output where I wrote 1 is mean(nan,(-0.44..),0.1127..,(-0.15..),(-0.19..),nan))
For EFX, the expected output where I wrote 2 is mean(nan,-0,14..,0.06..,0.13..,0.007,nan)
I tried the following but it returns only nans :
DF['Column8']=DF.groupby('Column1')['Column8'].mean()
Thanks,
I think you need something more like this:
You want the mean of colum7, not column8 right?
# use transform() as it will return a series of values that will match in legnth to your original dataframe
DF['Column8']=DF.groupby('Column1')['Column7'].transform('mean')

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Pandas DataFrame: sort_values by an index with empty strings

I have a pandas DataFrame with multi level index. I want to sort by one of the index levels. It has float values, but occasionally few empty strings too which I want to be considered as nan.
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df.sort_values('i')
TypeError: '<' not supported between instances of 'str' and 'int'
One way to solve the problem is to replace the empty strings with nan, do the sort, and then replace nan with empty strings again.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
Why there are empty strings in the first place?
In my application, actually the data read has missing values which is read as np.nan. But, np.nan values cause problem with groupby. So, they are replace to empty strings. I wish we had a constant like nan which is treated like empty string by groupby and like nan for numeric operations.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
In pandas missing values are not empty values, only if save DataFrame with missing values then are replaced by empty strings.
Btw, main problem is mixed values - numeric with strings (empty values), best is convert all strings to numeric for avoid it.
You can replace empty values by missing values by rename:
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df = df.rename({'':np.nan})
df = df.sort_values('i')
print (df)
x
i
1.0 1
2.0 2
3.0 3
NaN 4
Possible solution if cannot be changed original data is get positions of sorted values by Index.argsort and change order by DataFrame.iloc:
df = df.iloc[df.rename({'':np.nan}).index.argsort()]
print (df)
x
i
1 1
2 2
3 3
4

Convert floats to ints in pandas dataframe

I have a pandas dataframe with a column ‘distance’ and it is of datatype ‘float64’.
Distance
14.827379
0.754254
0.2284546
1.833768
I want to convert these numbers to whole numbers (14,0,0,1). I tried with this but I get the error “ValueError: Cannot convert NA to integer”.
df['distance(kmint)'] = result['Distance'].astype('int')
Any help would be appreciated!!
I filtered out the NaN's from the dataframe using this:
result = result[np.isfinite(result['distance(km)'])]
Then, I was able to convert from float to int.
An alternative approach would be to convert the NaN values as part of your data import and cleaning processes. The more generalized solution could involve specifying the values that are NaN in the read_table command by setting the na_values flag. What you want to make sure of is that there isn't some malfored data like 1.5km in one of your fields that getting picked up as a NaN value.
pandas.read_table(..., na_values=None, keep_default_na=True, na_filter=True, ....)
Subsequently, once the dataframe is populated and the NaN values are identified properly, you can use the fillna method to substitute in zeros or the values that you identified as your distances.
Finally, it would be best to probably use notnull versus isfinite to convert the over to integers.