Pandas .count() puts in first row name out of nowhere? - pandas

I have a pandas dataframe, where the first row is called school and the last row is called passed, and it has only numbers 1 and 0.
I simply wanted to count how often 1 or 0 occurs in that row.
i went with :
n_passed = df[df.passed==1].count()
the funny thing is, it gives me the correct number, but also outputs 'school', for a reason that is beyond me.
school 265
Can anyone bring light into this ?

IIUC you think no rows, but columns passed and school. Then you can use value_counts with column passed:
print df
school aa bb passed
0 1 0 1 1
1 0 1 0 0
2 1 1 0 1
3 0 0 1 1
n_passed1 = df.passed[df.passed==1].value_counts()
print n_passed1
1 3
Name: passed, dtype: int64
n_passed0 = df.passed[df.passed==0].value_counts()
print n_passed0
0 1
Name: passed, dtype: int64
But I think the best is use:
n_passed1 = df.passed.value_counts()
print n_passed1
1 3
0 1
Name: passed, dtype: int64

Related

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

Adding new column as a sum of the subsquent columns [duplicate]

This question already has answers here:
how do I insert a column at a specific column index in pandas?
(6 answers)
Closed last year.
I have this df:
id car truck bus bike
0 1 1 0 0
1 0 0 1 0
2 1 1 1 1
I want to add another column count to this df but after id and before car to sum the values of the rows, like this:
id count car truck bus bike
0 2 1 1 0 0
1 1 0 0 1 0
2 4 1 1 1 1
I know how to add the column using this code:
df.loc[:,'count'] = df.sum(numeric_only=True, axis=1)
but the above code add the new column in the last position.
How can I fix this?
There are several ways. I provided two ways here.
#1. Changing column order after creating count column:
df.loc[:,'count'] = df.sum(numeric_only=True, axis=1)
df.columns = ['id', 'count', 'car', 'truck', 'bus', 'bike']
print(df)
# id count car truck bus bike
#0 0 2 1 1 0 0
#1 1 2 0 0 1 0
#2 2 6 1 1 1 1
#2. Inserting a Series to specific position using insert function:
df.insert(1, "count", df.sum(numeric_only=True, axis=1))
print(df)
# id count car truck bus bike
#0 0 2 1 1 0 0
#1 1 2 0 0 1 0
#2 2 6 1 1 1 1
try this slight modification of your code:
import pandas as pd
df = pd.DataFrame(data={'id':[0,1,2],'car':[1,0,1],'truck':[1,0,1],'bus':[0,1,1],'bike':[0,0,1]})
count = df.drop(columns=['id'],axis=1).sum(numeric_only=True, axis=1)
df.insert(1, "count", count)
print(df)

Pandas Group Columns by Value of 1 and Sort By Frequency

I have to take this dataframe:
d = {'Apple': [0,0,1,0,1,0], 'Aurora': [0,0,0,0,0,1], 'Barn': [0,1,1,0,0,0]}
df = pd.DataFrame(data=d)
Apple Aurora Barn
0 0 0 0
1 0 0 1
2 1 0 1
3 0 0 0
4 1 0 0
5 0 1 0
And count the frequency of the number one in each column, and create a new dataframe that looks like this:
df = pd.DataFrame([['Apple',0.3333], ['Aurora',0.166666], ['Barn', 0.3333]], columns = ['index', 'value'])
index value
0 Apple 0.333300
1 Aurora 0.166666
2 Barn 0.333300
I have tried this:
df['freq'] = df.groupby(1)[1].transform('count')
But I get an error: KeyError: 1
So I'm not sure how to count the value 1 across rows and columns, and group by column names and the frequency of 1 in each column.
If I understand correctly, you could do simply this:
freq = df.mean()
Output:
>>> freq
Apple 0.333333
Aurora 0.166667
Barn 0.333333
dtype: float64

Dataframe apply set is not removing duplicate values

My dataset can sometimes include duplicates in one concatenated column like this:
Total
0 Thriller,Satire,Thriller
1 Horror,Thriller,Horror
2 Mystery,Horror,Mystery
3 Adventure,Horror,Horror
When doind this
df['Total'].str.split(",").apply(set)
I get
Total
0 {Thriller,Satire}
1 {Horror,Thriller}
2 {Mystery,Horror,Crime}
3 {Adventure,Horror}
And after encoding it with
df['Total'].str.get_dummies(sep=",")
I get a header looking like this
{'Horror {'Mystery {'Thriller ... Horror Thriller'}
Instead of
Horror Mystery Thriller
How do I get rid of the curly brackets when using Pandas dataframe?
Method Series.str.get_dummies working nice also with duplicates.
So omit code for unique values:
df['Total'] = df['Total'].str.split(",").apply(set)
And use only:
df1 = df['Total'].str.get_dummies(sep=",")
print (df1)
Adventure Horror Mystery Satire Thriller
0 0 0 0 1 1
1 0 1 0 0 1
2 0 1 1 0 0
3 1 1 0 0 0
BUt if need remopve duplicates add Series.str.join:
df1 = df['Total'].str.split(",").apply(set).str.join(',').str.get_dummies(sep=",")

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2