Set half of tesor columns to zero - numpy

I have a tensor of size m x n (m rows and n columns).
For example:
[ 5 8 4 3
1 3 5 4
3 9 8 6 ]
I wish to randomly select half of the columns, and set all the values in this columns as zeros.
For our example, it will create something like this:
[ 5 0 4 0
1 0 5 0
3 0 8 0 ]
I'm aware how to set zero randomly half of all the elements,
torch.rand(x.shape) > 0.5
but done randomly without consideration in the columns, which is not helpfull for my case.
Thank you for any help,
Dave

import torch
x = torch.rand(3,4)
x
tensor([[0.0143, 0.1070, 0.9985, 0.0727],
[0.4052, 0.8716, 0.7376, 0.5495],
[0.2553, 0.2330, 0.9285, 0.6535]])
for i in [1,3] : # list has your columns which you want to make zero
x[:,i] = 0

Related

pandas: idxmax for k-th largest

Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Replacing values in pandas data frame

I am looking for a pythonic way of replacing values based on whether values are big of small. Say I have a data frame:
ds = pandas.DataFrame({'x' : [4,3,2,1,5], 'y' : [4,5,6,7,8]})
I'd like to replace values on x which are lower than 2 by 2 and values higher than 4 by 4. And similarly with y values, replacing values lower than 5 by 5 and values higher than 7 by 7 so as to get this data frame:
ds = pandas.DataFrame({'x' : [4,3,2,2,4], 'y' : [5,5,6,7,7]})
I did it by iterating on the rows but is really ugly, any more pandas-pythonic way (Basically I want to eliminate extreme values)
You can check with clip
ds.x.clip(2,4)
Out[42]:
0 4
1 3
2 2
3 2
4 4
Name: x, dtype: int64
#ds.x=ds.x.clip(2,4)
#ds.y=ds.y.clip(5,7)
One way of doing this as follows:
>>> ds[ds.x.le(2) ] =2
>>> ds[ds.x.ge(4) ] =4
>>> ds
x y
0 4 4
1 3 5
2 2 6
3 2 2
4 4 4

Compute element overlap based on another column, pandas

If I have a dataframe of the form:
tag element_id
1 12
1 13
1 15
2 12
2 13
2 19
3 12
3 15
3 22
how can I compute the overlaps of the tags in terms of the element_id ? The result I guess should be an overlap matrix of the form:
1 2 3
1 X 2 2
2 2 X 1
3 2 1 X
where I put X on the diagonal since the overlap of a tag with itself is not relevant and where the numbers in the matrix represent the total element_ids that the two tags share.
My attempts:
You can try and use a for loop like :
for item in df.itertuples():
element_lst += [item.element_id]
element_tag = item.tag
# then intersect the element_list row by row.
# This is extremely costly for large datasets
The second thing I was thinking about was to use df.groupby('tag') and try to somehow intersect on element_id, but it is not clear to me how I can do that with grouped data.
merge + crosstab
# Find element overlap, remove same tag matches
res = df.merge(df, on='element_id').query('tag_x != tag_y')
pd.crosstab(res.tag_x, res.tag_y)
Output:
tag_y 1 2 3
tag_x
1 0 2 2
2 2 0 1
3 2 1 0

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6