How to get the mode of a column in pandas where there are few of the same mode values pandas - pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?

For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]

We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64

Related

pandas read csv is returning extra unknown column

I am creating a csv file from pandas dataframe by combining two lists using:
df= pd.DataFrame(list(zip(patients_full, labels)),
columns=['id','cancer'])
df.to_csv("labels.csv")
but when I read the csv back there is an unknown column unnamed that shows up ? how do I remove that ?
Unnamed: 0 id cancer
0 0 HF0953.npy 1
1 1 HF1058.npy 3
2 2 HF1071.npy 3
3 3 HF1122.npy 3
4 4 HF1235.npy 1
5 5 HF1280.npy 2
6 6 HF1344.npy 1
7 7 HF1463.npy 1
8 8 HF1489.npy 1
9 9 HF1490.npy 2
10 10 HF1587.npy 2
11 11 HF1613.npy 2
This is happening because of the index column that is saved by default when you do to_csv("labels.csv"). As the index column in the data frame that you were saving didn't have a name, when you read your read_csv("labels.csv") it is treated as all other columns but with 'Blank' column named that is becoming Unnamed: 0. To avoid this you have 2 options:
Option 1 - not read the index:
read_csv("labels.csv", index_col=False)
Option 2 - not save the index:
to_csv("labels.csv", index=False)
What that column is in your output is the index of the dataframe. To not include it in the output: df.to_csv('labels.csv', index=False). More information is available on that method here in the pandas docs

pandas: idxmax for k-th largest

Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Pandas Series Shift and Conditions return the truth value of a Series is ambiguous

I have a pandas Series df containing 10 values (all doubles).
My aim is to create a new Series as follow.
newSerie = 1 if df > df.shift(1) else 0
In other words newSerie outputs 1 if the current value of df is bigger than its previous value (it should output 0 otherwise).
However, I get :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In addition after my aim is to concatenate df and newSerie as a Dataframe, but newSerie outputs 9 value as we cannot compare the first value of df with shitf(1). Hence I need the first value of newSerie to be a empty value in order to be able to concatenate.
How can I do that?
To give an example imagine my input is only Series df. And my output should be as in the following image:
You can use shift or diff:
# example dataframe:
data = pd.DataFrame({'df':[10,9,12,13,14,15,18,16,20,1]})
df
0 10
1 9
2 12
3 13
4 14
5 15
6 18
7 16
8 20
9 1
Using Series.shift:
data['NewSerie'] = data['df'].gt(data['df'].shift()).astype(int)
Or Series.diff
data['NewSerie'] = data['df'].diff().gt(0).astype(int)
Output
df NewSerie
0 10 0
1 9 0
2 12 1
3 13 1
4 14 1
5 15 1
6 18 1
7 16 0
8 20 1
9 1 0

How to slice continuous and discontinuous index in pandas?

pandas iloc could slice dataframe two cases such as df.iloc[:,2:5] and df.iloc[:,[6,10]].
If I want to select 2:5, 6 and 10 columns, how to use iloc to slice df?
Use numpy.r_:
From docs:
Translates slice objects to concatenation along the first axis.
This is a simple way to build up arrays quickly. There are two use
cases.
If the index expression contains comma separated arrays, then stack
them along their first axis.
If the index expression contains slice
notation or scalars then create a 1-D array with a range indicated by
the slice notation.
Demo:
In [16]: df = pd.DataFrame(np.random.rand(3, 12))
In [17]: df.iloc[:, np.r_[2:5, 6, 10]]
Out[17]:
2 3 4 6 10
0 0.760201 0.378125 0.707002 0.310077 0.375646
1 0.770165 0.269465 0.419979 0.218768 0.832087
2 0.253142 0.737015 0.652522 0.474779 0.094145
In [18]: df
Out[18]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0.668062 0.581268 0.760201 0.378125 0.707002 0.249094 0.310077 0.336708 0.847258 0.705631 0.375646 0.830852
1 0.521096 0.798405 0.770165 0.269465 0.419979 0.455890 0.218768 0.833776 0.862483 0.817974 0.832087 0.958174
2 0.211815 0.747482 0.253142 0.737015 0.652522 0.274231 0.474779 0.256119 0.110760 0.224096 0.094145 0.525201
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
So I updated my answer in order to fix that deprecated feature: changed .ix[] --> df.iloc[...]
I think you need numpy.r_ for concanecate indices and then iloc for selecting by positions:
ds = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3],
'G':[1,3,5],
'H':[5,3,6],
'I':[4,4,3],
'J':[6,4,3],
'K':[9,4,3]})
print (ds)
A B C D E F G H I J K
0 1 4 7 1 5 7 1 5 4 6 9
1 2 5 8 3 3 4 3 3 4 4 4
2 3 6 9 5 6 3 5 6 3 3 3
print (np.r_[2:5, 6,10])
[ 2 3 4 6 10]
print (ds.iloc[:, np.r_[2:5, 6,10]])
C D E G K
0 7 1 5 1 9
1 8 3 3 3 4
2 9 5 6 5 3
To discussion:
ix vs iloc - main problem is ix will be deprecated in Pandas 0.20.0. And it seems new version is soon - in April, so better is use iloc.