Count number of each unique value in pandas column [duplicate] - pandas

This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed 1 year ago.
I have a dataframe like this:
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8,9,10,11,12])
df['group'] = [1,1,1,1,1,1,2,2,2,2,2,2]
df['Sex'] = ['male', 'female','male', 'male','male', 'female','male', 'male','male', 'female','female', 'female',]
df
group Sex
1 1 male
2 1 female
3 1 male
4 1 male
5 1 male
6 1 female
7 2 male
8 2 male
9 2 male
10 2 female
11 2 female
12 2 female
Each group has 6 people in it. Some are male, some are female. I want to get a dataframe which counts for every group in group the number of males and the number of females.
For example:
group 1 --> 4 male, 2 female
group 2 --> 3 male, 3 female
The details on how the result is presented is not important to me.
I have tried to use groupby, but there is no function (count, sum, mean, nunique...) which tells me the ratio between male and female.
Hope you can help me!

Use crosstab:
pd.crosstab(df['group'], df['Sex'])
Sex female male
group
1 2 4
2 3 3

Use groupby() method ,value_counts() method and unstack() method:
result=df.groupby('group')['Sex'].value_counts().unstack()
Now If you print result you will get:
Sex female male
group
1 2 4
2 3 3

Related

Pandas replace function specifying the column [duplicate]

This question already has answers here:
Replacing column values in a pandas DataFrame
(16 answers)
Closed 4 months ago.
dataset = pd.read_csv('./file.csv')
dataset.head()
This gives:
age sex smoker married region price
0 39 female yes no us 250000
1 28 male no no us 400000
2 23 male no yes europe 389000
3 17 male no no asia 230000
4 43 male no yes asia 243800
I want to replace all yes/no values of smoker with 0 or 1, but I don't want to change the yes/no values of married. I want to use pandas replace function.
I did the following, but this obviously changes all yes/no values (from smoker and married column):
dataset = dataset.replace(to_replace='yes', value='1')
dataset = dataset.replace(to_replace='no', value='0')
age sex smoker married region price
0 39 female 1 0 us 250000
1 28 male 0 0 us 400000
2 23 male 0 1 europe 389000
3 17 male 0 0 asia 230000
4 43 male 0 1 asia 243800
How can I ensure that only the yes/no values from the smoker column get changed, preferably using Pandas' replace function?
did you try:
dataset['smoker']=dataset['smoker'].replace({'yes':1, 'no':0})

Multilevel Indexing with Groupby

Being new to python I'm struggling to apply other questions about the groupby function to my data. A sample of the data frame :
ID Condition Race Gender Income
1 1 White Male 1
2 2 Black Female 2
3 3 Black Male 5
4 4 White Female 3
...
I am trying to use the groupby function to gain a count of how many black/whites, male/females, and income (12 levels) there are in each of the four conditions. Each of the columns, including income, are strings (i.e., categorical).
I'd like to get something such as
Condition Race Gender Income Count
1 White Male 1 19
1 White Female 1 17
1 Black Male 1 22
1 Black Female 1 24
1 White Male 2 12
1 White Female 2 15
1 Black Male 2 17
1 Black Female 2 19
...
Everything I've tried has come back very wrong so I don't think I'm anywhere near right, but I"m been using variations of
Data.groupby(['Condition','Gender','Race','Income'])['ID'].count()
When I run the above line I just get a 2 column matrix with an indecipherable index (e.g., f2df9ecc...) and the second column is labeled ID with what appear to be count numbers. Any help is appreciated.
if you would investigate the resulting dataframe you would see that the columns are inside the index so just reset the index...
df = Data.groupby(['Condition','Gender','Race','Income'])['ID'].count().reset_index()
that was mainly to demonstrate but since you what you want you can sepcify the argument 'as_index' as following:
df = Data.groupby(['Condition','Gender','Race','Income'],as_index=False)['ID'].count()
also since you want the last column to be 'count' :
df = df.rename(columns={'ID':'count'})

Pandas: Count rows in table based on two columns, using one columns' value

I currently have the following dataframe:
SN Gender Purchase
Name 1 Female 1.14
Name 2 Female 2.50
Name 3 Male 7.77
Name 1 Female 2.74
Name 3 Male 4.58
Name 3 Male 9.99
Name 1 Female 5.55
Name 2 Female 1.20
I am trying to figure out how to just get a count, not a Dataframe, from a table like this. The count must be based on gender (so, how many males are there?), but must be unique by name (SN). So, in this instance, I would have 1 male and 2 females. I have tried multiple ways...valuecounts from the data frame, unique from the dataframe, etc. but I keep getting syntax errors.
There are a few ways you can achieve this.
The simplest one would be to use pd.crosstab to get a cross tabulation (count) of the values:
pd.crosstab(df["SN"], df["Gender"])
Gender Female Male
SN
Name 1 3 0
Name 2 2 0
Name 3 0 3
Another way is to use DataFrame.value_counts() which cameabout in pandas version >= 1.1.0. Instead of a cross tabulation, this returns a Series whose values are the counts of data per unique index combination. The index is a MultiIndex referring to unique combinations of "SN" and "Gender"
df.value_counts(["SN", "Gender"])
SN Gender
Name 3 Male 3
Name 1 Female 3
Name 2 Female 2
dtype: int64
If you're operating with a pandas version older than 1.1.0 you can use a combination of groupby and value_counts. This performs a functionally equivalent operation as DataFrame.value_counts so we get the same output:
df.groupby("SN")["Gender"].value_counts()
SN Gender
Name 1 Female 3
Name 2 Female 2
Name 3 Male 3
Name: Gender, dtype: int64
Edit: If you want to only count the number of unique "SN" for each gender, you can use nunique() instead of value_counts:
unique_genders = df.groupby(["Gender"])["SN"].nunique()
print(unique_genders)
Gender
Female 2
Male 1
Name: SN, dtype: int64
Then you can extract each:
>>> unique_genders["Female"]
2
>>> unique_geners["Male"]
1

group by aggregate function for multiplication

I want to aggregate 3 dataframes I have but instead of adding them together. I want to multiply 3 of them. is there a way to do it ?
i.e.
df=result.groupby(['name']).agg({'A':'sum','B':'sum'})
df1
A B
tim 1 5
emma 3 7
df2
A B
tim 1 8
emma 1 2
result
A B
tim 2 13
emma 4 9
Instead of summing the two, I want to multiply them:
A B
tim 1 40
emma 12 18
Use GroupBy.prod:
df=result.groupby(['name']).agg({'A':'prod','B':'prod'})
If need also join them:
df = pd.concat([df1, df2]).groupby('name', as_index=False).prod()

Pandas Dataframe and duplicate names [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
I've a Pandas dataframe, and some numerical data about some people.
What I need to do is to find people that appare more than one time in the dataframe, and to substitute all the row about one people with one row where the numeric values are the sum of the numeric values of the rows before.
Example:
Names Column1 Column1
Jonh 1 2
Bob 2 3
Pier 1 1
John 3 3
Bob 1 0
Have to become:
Names Column1 Column1
Jonh 4 5
Bob 3 3
Pier 1 1
How can I do?
Try this:
In [975]: df.groupby('Names')[['Column1','Column2']].sum()
Out[975]:
Column1 Column2
Names
Bob 3 3
John 4 5
Pier 1 1
groupby and sum should do the job
df.groupby('Names').sum().sort_values('Column1', ascending=False)
Column1 Column1.1
Names
Jonh 4 5
Bob 3 3
Pier 1 1