Apply noise on non zero elements of data frame - pandas

I am a bit struggling with this one.
I have a dataframe, and I want to apply gaussian noise only on the non zero elements of the data frame. A silly way to do this is :
mu, sigma = 0, 0.1
for i in range(df.shape[0]):
for j in range(df.shape[1]):
if df.iat[i,j] != 0:
df.iat[i,j] += np.random.normal(mu,sigma)
Noise must be different for each element, we do not add the same value each time.
And I would be happy if only this worked. Actually for some reason it does not. Instead, I got this :
before noise
after noise
As you can see on the image, for columns A and C it is working well, but not for the others. What is weird is that there is still a change (+/- 1, so far from what one would except of a gaussian noise...)
I tried to see if this was some decimals problem with df.round() but nothing came up.
So I am looking for another way to apply my noise mostly rather than to solve this weird problem. Thank you by advance.

I believe you can generate array with same size as orignal DataFrame and then add values by condition with where:
np.random.seed(234)
df = pd.DataFrame(np.random.randint(5, size=(5,5)))
print (df)
0 1 2 3 4
0 0 4 1 1 3
1 3 0 3 3 2
2 0 2 4 1 3
3 4 0 3 0 2
4 3 1 3 3 1
mu, sigma = 0, 0.1
a = np.random.normal(mu,sigma, size=df.shape)
print (a)
[[ 0.10452115 -0.01051424 -0.13329652 -0.06376671 0.07245456]
[-0.21753186 0.05700441 0.03595196 -0.08154859 0.0076684 ]
[ 0.08368405 0.10390984 0.04692948 0.09711873 -0.06820933]
[-0.07229613 0.03954906 -0.06136678 -0.02328597 -0.22123564]
[-0.04316055 0.05945377 0.13736261 0.07895045 0.03714287]]
df = df.where(df == 0, df.add(a))
print (df)
0 1 2 3 4
0 0.000000 3.989486 0.866703 0.936233 3.072455
1 2.782468 0.000000 3.035952 2.918451 2.007668
2 0.000000 2.103910 4.046929 1.097119 2.931791
3 3.927704 0.000000 2.938633 0.000000 1.778764
4 2.956839 1.059454 3.137363 3.078950 1.037143

Related

Set half of tesor columns to zero

I have a tensor of size m x n (m rows and n columns).
For example:
[ 5 8 4 3
1 3 5 4
3 9 8 6 ]
I wish to randomly select half of the columns, and set all the values in this columns as zeros.
For our example, it will create something like this:
[ 5 0 4 0
1 0 5 0
3 0 8 0 ]
I'm aware how to set zero randomly half of all the elements,
torch.rand(x.shape) > 0.5
but done randomly without consideration in the columns, which is not helpfull for my case.
Thank you for any help,
Dave
import torch
x = torch.rand(3,4)
x
tensor([[0.0143, 0.1070, 0.9985, 0.0727],
[0.4052, 0.8716, 0.7376, 0.5495],
[0.2553, 0.2330, 0.9285, 0.6535]])
for i in [1,3] : # list has your columns which you want to make zero
x[:,i] = 0

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Groupby and multiindexes - how to organize data with irregular sizes?

I am trying to organize 3D data collected from several participants with a different number of samples for each participant. Each participant has a unique session and seat index in the experiment. For each participant i, I have a 3D array composed of Ni images (height*width).
I first tried by creating a Dataset of participants but I ended up having many NaNs due to the fact that participants have different samples on the same dimension (sample dim). I then switched to a unique DataArray containing all my participants data concatenated on a single dimension I call depth. This dimension is then associated to a multiindex coordinate combining session, seatand sample coordinates:
<xarray.DataArray (depth: 52, height: 4, width: 4)>
array([[[0.92337111, 0.86505447, 0.08541727, 0.74850848],
[0.02336959, 0.0495726 , 0.98745956, 0.58831929],
[0.62128185, 0.7732787 , 0.27716268, 0.83634779],
[0.08146719, 0.35851012, 0.44170263, 0.74338872]],
...
[[0.4365896 , 0.23527988, 0.86891853, 0.94486637],
[0.20884748, 0.81012315, 0.61542411, 0.76706922],
[0.33391262, 0.88955315, 0.25329999, 0.35803887],
[0.49586615, 0.94767265, 0.40868892, 0.42393425]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- session (depth) int64 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 3 3 3 3 3 3 3 3 3 3 3 3
- seat (depth) int64 0 0 0 0 0 1 1 1 1 1 1 0 0 ... 0 0 0 0 0 1 1 1 1 1 1 1
- sample (depth) int64 0 1 2 3 4 0 1 2 3 4 5 0 1 ... 1 2 3 4 5 0 1 2 3 4 5 6
However I find this solution not really usable for several reasons:
each time I want to perform a groupby I have to reset the index to recreate one with the coordinates I want to group since xarray does not support multiple groupby on the same dim:
da = da.reset_index('depth')
da = da.set_index(depth=['session', 'seat'])
da.groupby('depth').mean()
the result of the code above is not perfect as it does not maintain the multiindex names:
<xarray.DataArray (depth: 8, height: 4, width: 4)>
array([[[0.47795382, 0.67322777, 0.12946181, 0.48983815],
[0.33895882, 0.46772217, 0.62886196, 0.55970122],
[0.57370573, 0.47272117, 0.31529004, 0.63230245],
[0.63230284, 0.5352105 , 0.65805407, 0.65274841]],
...
[[0.55672404, 0.37963945, 0.57334768, 0.64853806],
[0.46608072, 0.39506509, 0.66339553, 0.71447367],
[0.58989461, 0.66066485, 0.53271228, 0.43036214],
[0.44163921, 0.54990042, 0.4229631 , 0.5941268 ]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- depth_level_0 (depth) int64 0 0 1 1 2 2 3 3
- depth_level_1 (depth) int64 0 1 0 1 0 1 0 1
I can use sel only on fully indexed data (i.e. by using session, seatand sample in the depth index), so I end up re-indexing my data again and again.
I find using hvplot on such DataArray not really straightforward (skipping the details here for easier reading of this already long post).
Is there something I am missing ? Is there a better way to organize my data ? I tried to create mutliple indexes on the same dim for convenience but without success.

Replacing values in pandas data frame

I am looking for a pythonic way of replacing values based on whether values are big of small. Say I have a data frame:
ds = pandas.DataFrame({'x' : [4,3,2,1,5], 'y' : [4,5,6,7,8]})
I'd like to replace values on x which are lower than 2 by 2 and values higher than 4 by 4. And similarly with y values, replacing values lower than 5 by 5 and values higher than 7 by 7 so as to get this data frame:
ds = pandas.DataFrame({'x' : [4,3,2,2,4], 'y' : [5,5,6,7,7]})
I did it by iterating on the rows but is really ugly, any more pandas-pythonic way (Basically I want to eliminate extreme values)
You can check with clip
ds.x.clip(2,4)
Out[42]:
0 4
1 3
2 2
3 2
4 4
Name: x, dtype: int64
#ds.x=ds.x.clip(2,4)
#ds.y=ds.y.clip(5,7)
One way of doing this as follows:
>>> ds[ds.x.le(2) ] =2
>>> ds[ds.x.ge(4) ] =4
>>> ds
x y
0 4 4
1 3 5
2 2 6
3 2 2
4 4 4

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6