Pandas: how to use convert_objects to replace strings with NaN values - pandas

This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.

This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]

Related

Apply noise on non zero elements of data frame

I am a bit struggling with this one.
I have a dataframe, and I want to apply gaussian noise only on the non zero elements of the data frame. A silly way to do this is :
mu, sigma = 0, 0.1
for i in range(df.shape[0]):
for j in range(df.shape[1]):
if df.iat[i,j] != 0:
df.iat[i,j] += np.random.normal(mu,sigma)
Noise must be different for each element, we do not add the same value each time.
And I would be happy if only this worked. Actually for some reason it does not. Instead, I got this :
before noise
after noise
As you can see on the image, for columns A and C it is working well, but not for the others. What is weird is that there is still a change (+/- 1, so far from what one would except of a gaussian noise...)
I tried to see if this was some decimals problem with df.round() but nothing came up.
So I am looking for another way to apply my noise mostly rather than to solve this weird problem. Thank you by advance.
I believe you can generate array with same size as orignal DataFrame and then add values by condition with where:
np.random.seed(234)
df = pd.DataFrame(np.random.randint(5, size=(5,5)))
print (df)
0 1 2 3 4
0 0 4 1 1 3
1 3 0 3 3 2
2 0 2 4 1 3
3 4 0 3 0 2
4 3 1 3 3 1
mu, sigma = 0, 0.1
a = np.random.normal(mu,sigma, size=df.shape)
print (a)
[[ 0.10452115 -0.01051424 -0.13329652 -0.06376671 0.07245456]
[-0.21753186 0.05700441 0.03595196 -0.08154859 0.0076684 ]
[ 0.08368405 0.10390984 0.04692948 0.09711873 -0.06820933]
[-0.07229613 0.03954906 -0.06136678 -0.02328597 -0.22123564]
[-0.04316055 0.05945377 0.13736261 0.07895045 0.03714287]]
df = df.where(df == 0, df.add(a))
print (df)
0 1 2 3 4
0 0.000000 3.989486 0.866703 0.936233 3.072455
1 2.782468 0.000000 3.035952 2.918451 2.007668
2 0.000000 2.103910 4.046929 1.097119 2.931791
3 3.927704 0.000000 2.938633 0.000000 1.778764
4 2.956839 1.059454 3.137363 3.078950 1.037143

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

how to create a variable in pandas dataframe based on another variable

I have an issue where I have a dataframe data with multiple columns and I want to create a variable filter in the dataframe and assign the value 1 if activation_date is null else 0.
I have written this code but this is failing to get the results, everything is getting 0 irrespective if the dates are still present.
data['filter'] = [0 if x is not None else 1 for x in data['activation_dt']]
I think you need isnull for check None or NaNs and then convert True to 1 and False to 0 by astype(int):
data = pd.DataFrame({'activation_dt':[None, np.nan, 1]})
print (data)
activation_dt
0 NaN
1 NaN
2 1.0
data['filter'] = data['activation_dt'].isnull().astype(int)
print (data)
activation_dt filter
0 NaN 1
1 NaN 1
2 1.0 0

Why does Panel.groupby() not accept level parameter in pandas 0.14?

pn = Panel(randn(4,3,3), items=['a','a','b','b'], major_axis=np.arange(0,3), minor_axis=np.arange(0,3))
I'd like to average repeated items.
But...
pn.groupby(level=0)
TypeError: groupby() got an unexpected keyword argument 'level'
pn.groupby(axis='items')
TypeError: groupby() takes at least 2 arguments (2 given)
which i don't fully inderstand...
Is there another way out?
You need to pass a mapping function (which in this case in an identity function).
As of 0.14.1 and prior, the API for Panel.groupby is different from Series/DataFrame.groupby (as it has not been updated). 0.15.0 will address this.
In [6]: pn = Panel(randn(4,3,3), items=['a','a','b','b'], major_axis=np.arange(0,3), minor_axis=np.arange(0,3))
In [7]: pn
Out[7]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: a to b
Major_axis axis: 0 to 2
Minor_axis axis: 0 to 2
Showing converted to a frame just for clarity
In [10]: pn.to_frame()
Out[10]:
a a b b
major minor
0 0 0.149552 0.149552 -3.750202 -3.750202
1 -1.354459 -1.354459 0.744473 0.744473
2 -0.183734 -0.183734 0.081965 0.081965
1 0 -1.946947 -1.946947 0.039065 0.039065
1 -0.648491 -0.648491 -0.141705 -0.141705
2 -1.581965 -1.581965 -0.628115 -0.628115
2 0 -1.280040 -1.280040 -0.556467 -0.556467
1 -0.093943 -0.093943 0.722981 0.722981
2 -0.207690 -0.207690 0.914684 0.914684
In [11]: pn.groupby(lambda x: x, axis='items').mean().to_frame()
Out[11]:
a b
major minor
0 0 0.733896 -1.814611
1 -1.021487 0.182690
2 -0.791080 -0.040136
1 0 -1.141415 -0.445435
1 -0.678486 -0.395380
2 -1.504996 0.172791
2 0 -0.405256 -0.999300
1 0.001912 0.272143
2 -0.987270 0.154344

Pandas, multiindex, dates, HDFstore and frame_tables

I want to use a MultiIndex with dates as one of the hierarchical index types. I also want to save the DataFrame as a frame_table, so that I can select subsets from disk without loading the whole thing. I currently get an error: TypeError: [date] is not implemented as a table column and I was wondering if I am using the multiindex incorrectly, or this is indeed a limitation of Pandas. Thanks!
import pandas as pd, numpy, datetime
print pd.__version__ #-> 0.13.0rc1
idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
with pd.get_store('test1.h5') as f:
f.put('trials',df1) #-> OK
with pd.get_store('test2.h5') as f:
f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column
Use datetime.datetime as these types can be stored efficiently. Docs are here for an example of storing a multi-index frame in a HDFStore.
When storing a multi-index, you MUST specify names for the levels (HDFStore currently won't warn you if you try to store it ATM; this will be addressed in the next release).
In [20]: idx1 = pd.MultiIndex.from_tuples([(datetime.datetime(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)],names=['date','s','t'])
In [21]: df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
You need to store as a table (put stores in Fixed format, unless append is specified).
In [22]: df1.to_hdf('test.h5','df',mode='w',format='table')
In [23]: pd.read_hdf('test.h5','df')
Out[23]:
a b
date s t
2013-12-01 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[12 rows x 2 columns]
Sample selection
In [8]: pd.read_hdf('test.h5','df',where='date=20131202')
Out[8]:
a b
date s t
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[6 rows x 2 columns]