Being new to python I'm struggling to apply other questions about the groupby function to my data. A sample of the data frame :
ID Condition Race Gender Income
1 1 White Male 1
2 2 Black Female 2
3 3 Black Male 5
4 4 White Female 3
...
I am trying to use the groupby function to gain a count of how many black/whites, male/females, and income (12 levels) there are in each of the four conditions. Each of the columns, including income, are strings (i.e., categorical).
I'd like to get something such as
Condition Race Gender Income Count
1 White Male 1 19
1 White Female 1 17
1 Black Male 1 22
1 Black Female 1 24
1 White Male 2 12
1 White Female 2 15
1 Black Male 2 17
1 Black Female 2 19
...
Everything I've tried has come back very wrong so I don't think I'm anywhere near right, but I"m been using variations of
Data.groupby(['Condition','Gender','Race','Income'])['ID'].count()
When I run the above line I just get a 2 column matrix with an indecipherable index (e.g., f2df9ecc...) and the second column is labeled ID with what appear to be count numbers. Any help is appreciated.
if you would investigate the resulting dataframe you would see that the columns are inside the index so just reset the index...
df = Data.groupby(['Condition','Gender','Race','Income'])['ID'].count().reset_index()
that was mainly to demonstrate but since you what you want you can sepcify the argument 'as_index' as following:
df = Data.groupby(['Condition','Gender','Race','Income'],as_index=False)['ID'].count()
also since you want the last column to be 'count' :
df = df.rename(columns={'ID':'count'})
Related
This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed 1 year ago.
I have a dataframe like this:
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8,9,10,11,12])
df['group'] = [1,1,1,1,1,1,2,2,2,2,2,2]
df['Sex'] = ['male', 'female','male', 'male','male', 'female','male', 'male','male', 'female','female', 'female',]
df
group Sex
1 1 male
2 1 female
3 1 male
4 1 male
5 1 male
6 1 female
7 2 male
8 2 male
9 2 male
10 2 female
11 2 female
12 2 female
Each group has 6 people in it. Some are male, some are female. I want to get a dataframe which counts for every group in group the number of males and the number of females.
For example:
group 1 --> 4 male, 2 female
group 2 --> 3 male, 3 female
The details on how the result is presented is not important to me.
I have tried to use groupby, but there is no function (count, sum, mean, nunique...) which tells me the ratio between male and female.
Hope you can help me!
Use crosstab:
pd.crosstab(df['group'], df['Sex'])
Sex female male
group
1 2 4
2 3 3
Use groupby() method ,value_counts() method and unstack() method:
result=df.groupby('group')['Sex'].value_counts().unstack()
Now If you print result you will get:
Sex female male
group
1 2 4
2 3 3
I'm trying to wrangle some data to show how many items a range of people have in common. The goal is to show this data in a heatmap format via Seaborn to understand these overlaps visually.
Here's some sample data:
demo_df = pd.DataFrame([
("Get Back", 1,0,2),
("Help", 5, 2, 0),
("Let It Be", 0,2,2)
],columns=["Song","John", "Paul", "Ringo"])
demo_df.set_index("Song")
John Paul Ringo
Song
Get Back 1 0 2
Help 5 2 0
Let It Be 0 2 2
I don't need a breakdown by song, just the total of shared items. The resulting data would show a sum of how many items they share like this:
Name
John
Paul
Ringo
John
-
7
3
Paul
7
-
4
Ringo
3
4
-
So far I've tried a few options with groupby and unstack but haven't been able to work out how to cross match the names into both column and header rows.
We may do dot then fill diag
out = df.T.dot(df.ne(0)) + df.T.ne(0).dot(df)
np.fill_diagonal(out.values, 0)
out
Out[176]:
John Paul Ringo
John 0 7 3
Paul 7 0 4
Ringo 3 4 0
I currently have the following dataframe:
SN Gender Purchase
Name 1 Female 1.14
Name 2 Female 2.50
Name 3 Male 7.77
Name 1 Female 2.74
Name 3 Male 4.58
Name 3 Male 9.99
Name 1 Female 5.55
Name 2 Female 1.20
I am trying to figure out how to just get a count, not a Dataframe, from a table like this. The count must be based on gender (so, how many males are there?), but must be unique by name (SN). So, in this instance, I would have 1 male and 2 females. I have tried multiple ways...valuecounts from the data frame, unique from the dataframe, etc. but I keep getting syntax errors.
There are a few ways you can achieve this.
The simplest one would be to use pd.crosstab to get a cross tabulation (count) of the values:
pd.crosstab(df["SN"], df["Gender"])
Gender Female Male
SN
Name 1 3 0
Name 2 2 0
Name 3 0 3
Another way is to use DataFrame.value_counts() which cameabout in pandas version >= 1.1.0. Instead of a cross tabulation, this returns a Series whose values are the counts of data per unique index combination. The index is a MultiIndex referring to unique combinations of "SN" and "Gender"
df.value_counts(["SN", "Gender"])
SN Gender
Name 3 Male 3
Name 1 Female 3
Name 2 Female 2
dtype: int64
If you're operating with a pandas version older than 1.1.0 you can use a combination of groupby and value_counts. This performs a functionally equivalent operation as DataFrame.value_counts so we get the same output:
df.groupby("SN")["Gender"].value_counts()
SN Gender
Name 1 Female 3
Name 2 Female 2
Name 3 Male 3
Name: Gender, dtype: int64
Edit: If you want to only count the number of unique "SN" for each gender, you can use nunique() instead of value_counts:
unique_genders = df.groupby(["Gender"])["SN"].nunique()
print(unique_genders)
Gender
Female 2
Male 1
Name: SN, dtype: int64
Then you can extract each:
>>> unique_genders["Female"]
2
>>> unique_geners["Male"]
1
I have the following dataframe df, which specifies latitudes and longitudes for a certain groupnumber:
latitude longitude group
0 51.822231 4.700267 1
1 51.822617 4.801417 1
2 51.823235 4.903300 1
3 51.823433 5.003917 1
4 51.823616 5.504467 1
5 51.822231 3.900267 2
6 51.822617 3.901417 2
7 51.823235 3.903300 2
8 51.823433 6.903917 2
9 51.823616 8.904467 2
10 51.822231 1.900267 3
11 51.822617 2.901417 3
12 51.823235 11.903300 3
13 51.823433 12.903917 3
14 51.823616 13.904467 3
Within each groupnumber I try to find the lower and upper neighbour of the column 'longitude' for a specified value longitude_value = 5.00. All longitudes within each group 'trips' are sorted in df (they ascend in each group)
Per row I want to have the upper and lower neighbour values of longitude=5.000000. The desired output looks like:
latitude longitude trip
2 51.823235 4.903300 1
3 51.823433 5.003917 1
7 51.823235 3.903300 2
8 51.823433 6.903917 2
11 51.822617 2.901417 3
12 51.823235 11.903300 3
From this result I want to rearrange the data a little bit as:
lat_lo lat_up lon_lo lon_up
0 51.823235 51.823433 4.903300 5.003917
1 51.823235 51.823433 3.903300 6.903917
2 51.822617 51.823235 2.901417 11.903300
Hope I got your question right. See my attempt below. Made it long to be explicit in my approach. I could have easily introduced a longitude value of 5.00 and sliced on index but that would have complicated answering part 2 of your question. If I missed something, let me know.
Data
df=pd.read_clipboard()
df
Input point and calculate difference with longitude
fn=5.00
df['dif']=(df['longitude']-fn)
df
Find the minimum positive difference in each group
df1=df[df['dif'] > 0].groupby('group').min().reset_index().reindex()
Find the minimum negative difference in each group
df2=df[df['dif'] < 0].groupby('group').max().reset_index().reindex()
Append the second group above to the first into one df. This answers your question 1
df3=df1.append(df2, ignore_index=True).sort_values(['group','longitude'])
df3
Question 2
Introduce a column called status and append a pattern, 3 for the lower neighbor and 4 for the upper neighbor
df3['Status']=0
np.put(df3['Status'], np.arange(len(df3)), ['3','4'])
df3.drop(columns=['dif'], inplace=True)
df3
Rename the neighbours to lon_lo and lon_up
df3['Status']=np.where(df3['Status']==3,'lon_lo', (np.where(df3['Status']==4,'lon_up',df3['Status'] )))
Using pivot, break up the dataframe into lon_lo and latitude and do the same to lon_up. The rational here is to break up latitudes into two groups lo and up
first group break
df4=df3[df3['Status']=='lon_lo']
result=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
second group break
df4=df3[df3['Status']=='lon_up']
result1=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
Merge on index the two groups while renaming the latitudes to lo and up
final=result1.merge(result, left_index=True, right_index=True, suffixes=('_lo','_up'))
final
Output
I am working with some family data which holds records on caregivers and the number of children that caregiver has. Currently, demographic information for the caregiver and all children that caregiver has is in the caregiver record. I want to take children's demographic information and put it into the the children's respective record/row. Here is an example of the data I am working with:
Vis POS FAMID G1ID G2ID G1B G2B1 G2B2 G2B3 G1R G2R1 G2R2 G2R3
1 0 1 100011 1979 2010 White White
1 1 1 200011
1 0 2 100021 1969 2011 2009 AA AA White
1 1 2 200021
1 2 2 200022
1 0 3 100031 1966 2008 2010 2011 White White AA AA
1 1 3 200031
1 2 3 200032
1 3 3 200033
G1 = caregiver data
G2 = child data
GxBx = birthyear
GxRx = race
OUTPUT
Visit POS FAMID G1 G2 G1Birth G2Birth G1Race G2Race
1 0 1 100011 1979 White
1 1 1 200011 2010 White
1 0 2 100021 1969 AA
1 1 2 200021 2011 AA
1 2 2 200022 2009 White
1 0 3 100031 1966 White
1 1 3 200031 2008 White
1 2 3 200032 2010 AA
1 3 3 200033 2011 AA
From these two tables you can see I want all G2Bx columns to fall into a new G2Birth column, and same principle for G2Rx columns. (I actually have several more instances like race and birthyear in my actual data)
I have been looking into pivots and stacking functions in the pandas dataframe but I hvaen't quite got what I wanted. The closest I have gotten was using the melt function, but the issue I have with the melt function was I couldn't get it to map to indexes with out taking all values from that column. IE it wants to create a row for child2 and child3 for people who only have child1. I might just be using the melt function incorrectly.
What I want are all values from g2Birthdate1 to map onto POS when POS=1, and all g2Birthdate2 to the POS=2 index, etc. Is there a function which can help accomplish this? Or does this require a some additional coding solution?
You can do this with a row and a column MultiIndex and a left join:
# df is your initial dataframe
# Make a baseline dataframe to hold the IDs
id_df = df.drop(columns=[c for c in df.columns if c not in ["G1ID", "G2ID","Vis","FAMID","POS"]])
# Make a rows MultiIndex to join on at the end
id_df = id_df.set_index(["Vis","FAMID","POS"])
# Rename the columns to reflect the hierarchical nature
data_df = df.drop(columns=[c for c in df.columns if c in ["G1ID", "G2ID", "POS"]])
# Make the first two parts of the MultiIndex required for the join at the end
data_df = data_df.set_index(["Vis","FAMID"])
# Make the columns also have a MultiIndex
data_df.columns = pd.MultiIndex.from_tuples([("G1Birth",0),("G2Birth",1),("G2Birth",2),("G2Birth",3),
("G1Race",0),("G2Race",1),("G2Race",2),("G2Race",3)])
# Name the columnar index levels
data_df.columns.names = (None, "POS")
# Stack the newly formed lower-level into the rows MultiIndex to complete it in prep for joining
data_df = data_df.stack("POS")
# Join to the id dataframe on the full MultiIndex
final = id_df.join(data_df)
final = final.reset_index()