How to drop values for all rows in pandas - pandas

I have code that looks like this:
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.000000 0.039457 0.032901 0.014793 0.006614 0.006591 0.000000
1 o75461 0.000000 0.005832 0.027698 0.000000 0.000000 0.006634 0.000000
There's thousands of rows of proteins. However, I want to drop the rows in pandas where all of the values in the row for all of the diseases are less than 0.01. How do I do this?

You can use loc in combination with any. Basically you want to keep all rows where any value is above or equal to 0.01. Note, I adjusted your example to have the second protein have all values < 0.01.
import pandas as pd
df = pd.DataFrame([
['q9uku9', 0.000000, 0.039457, 0.032901, 0.014793, 0.006614, 0.006591, 0.000000 ],
['o75461', 0.000000, 0.005832, 0.007698, 0.000000, 0.000000, 0.006634, 0.000000]
], columns=['protein', 'IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO'])
df = df.set_index('protein')
df_filtered = df.loc[(df >= 0.01).any(axis=1)]
Which gives:
IHD CM ARR VD CHD CCD VOO
protein
q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0

>>> df
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0
1 o75461 0.0 0.005832 0.027698 0.000000 0.000000 0.006634 0.0
2 d4acr8 0.0 0.001490 0.003920 0.000000 0.000000 0.009393 0.0
>>> df.loc[~(df.select_dtypes(float) < 0.01).all(axis="columns")]
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.0 0.039457 0.032901 0.014793 0.006614 0.006591 0.0
1 o75461 0.0 0.005832 0.027698 0.000000 0.000000 0.006634 0.0

Related

Julia - using CartesianIndices with an array

I am trying to access specific elements of an NxN matrix 'msk', with indices stored in a Mx2 array 'idx'. I tried the following:
N = 10
msk = zeros(N,N)
idx = [1 5;6 2;3 7;8 4]
#CIs = CartesianIndices(( 2:3, 5:6 )) # this works, but not what I want
CIs = CartesianIndices((idx[:,1],idx[:,2]))
msk[CIs] .= 1
I get the following: ERROR: LoadError: MethodError: no method matching CartesianIndices(::Tuple{Array{Int64,1},Array{Int64,1}})
Is this what you want? (I am using your definitions)
julia> msk[CartesianIndex.(eachcol(idx)...)] .= 1;
julia> msk
10×10 Array{Float64,2}:
0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note that I use a vector of CartesianIndex:
julia> CartesianIndex.(eachcol(idx)...)
4-element Array{CartesianIndex{2},1}:
CartesianIndex(1, 5)
CartesianIndex(6, 2)
CartesianIndex(3, 7)
CartesianIndex(8, 4)
as CartesianIndices is:
Define a region R spanning a multidimensional rectangular range of integer indices.
so the region defined by it must be rectangular.
Another way to get the required indices would be e.g.:
julia> CartesianIndex.(Tuple.(eachrow(idx)))
4-element Array{CartesianIndex{2},1}:
CartesianIndex(1, 5)
CartesianIndex(6, 2)
CartesianIndex(3, 7)
CartesianIndex(8, 4)
or (this time we use linear indexing into msk as it is just a Matrix)
julia> [x + (y-1)*size(msk, 1) for (x, y) in eachrow(idx)]
4-element Array{Int64,1}:
41
16
63
38

pandas assign result from list of columns

Suppose I have a dataframe as shown below:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 0.0
And I have the list of columns which I want to populate with the value of 1, when A is negative.
idx = df.A < 0
cols = ['B', 'C']
So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.
What I tried:
However, doing df.loc[idx, cols] = 1 sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols) which gave the result:
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 1.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 NaN NaN
I'm assuming this is because the index of get_dummies and the dataframe don't line up.
Expected Output:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.
Timing of results:
TLDR: editing values directly is faster.
%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1
123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1
266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use numpy indexing for improve performance:
idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415 0. 0. ]
[-0.1382643 1. 0. ]
[ 0.64768854 0. 0. ]
[ 1.52302986 0. 0. ]
[-0.23415337 0. 1. ]]
df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Alternative:
idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
ind = df.index[idx]
for idx,col in zip(ind,res):
...: df.at[idx,col] = 1
In [7]: df
Out[7]:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0

Finding the distances from each point to the rest, looping

I am new to python.
I have a csv file containing 400 pairs of x and y in two columns.
I want to loop over the data such that it starts from a pair (x_i,y_i) and finds the distance between that pair and the rest of the 399 points. I want the process to be repeated for all pairs of (x_i,y_i) and the result is appended to to a list Dist_i
import pandas as pd
x_y_data = pd.read_csv("x_y_points400_labeled_csv.csv")
x = x_y_data.loc[:,'x']
y = x_y_data.loc[:,'y']
i=0
j=0
while (i<len(x)):
Dist=np.sqrt((x[i]-x)**2 + (y[j]-y)**2)
i = 1 + i
j = 1 + j
print(Dist)
output:
0 676.144955
1 675.503342
2 674.642602
..
396 9.897127
397 21.659654
398 15.508062
399 0.000000
Length: 400, dtype: float64
This is how far I went, but it is not what I intend to obtain. My goal is to get something like in the picture attached.
Thanks for your help in advance
enter image description here
You can use broadcasting (arr[:, None]) to do this calculation all at once. This will give you the repetitive calculations you want. Otherwise scipy.spatial.distance.pdist gives you the upper triangle of the calculations.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
N = 6
df = pd.DataFrame(np.random.normal(0, 1, (N, 2)),
columns=['X', 'Y'],
index=[f'point{i}' for i in range(N)])
x = df['X'].to_numpy()
y = df['Y'].to_numpy()
result = pd.DataFrame(np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2),
index=df.index,
columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000
With scipy.
from scipy.spatial.distance import pdist
pdist(df[['X', 'Y']])
array([2.8532972 , 0.82759587, 1.95770875, 3.00078036, 1.16534282,
3.27316125, 2.91598992, 1.17270443, 1.70814458, 2.78266933,
3.1214628 , 1.74902298, 3.7184812 , 1.77945856, 2.09245472])
To turn this into the above DataFrame.
L = len(df)
arr = np.zeros((L, L))
arr[np.triu_indices(L, 1)] = pdist(df[['X', 'Y']])
arr = arr + arr.T # Lower triangle b/c symmetric
pd.DataFrame(arr, index=df.index, columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000

Unable to set cell values in Python Pandas SparseDataFrame

I'm having a hard time getting cell values to stick in a SparseDataFrame when updating by index/column. I've tried setting cell values using df.at, df.ix, df.loc and the dataframe remains empty.
df = pd.SparseDataFrame(np.zeros((10,10)), default_fill_value=0)
df.at[1,1] = 1
df.ix[2,2] = 1
df.loc[3,3] = 1
df
0 1 2 3 4 5 6 7 8 9
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Any one of these options work fine on a standard dataframe.
The one option I've found that works is
df = df.set_values(1, 1, 1)
But this is terribly slow for a large matrix.
[edit] I did see a 4 year old answer that suggested the below, but it suggests that more direct methods were in the works. I also haven't tested speed on this, but it seems like converting whole slides of the matrix to dense and back would be much slower than directly updating a row,col, val entry as you can with scipy sparse matrices.
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64