Conditional Sum Pandas Dataframe - pandas

I am trying to aggregate and sum values from a Pandas Dataframe based on the values in the column "Gender". This is a sample of the dataset that I am working on:
df_genders = pd.DataFrame({'Country': ['US','US','US','US','US','India','India','India','UK','UK','UK','UK'],
'Gender': ['Man','Woman', 'Non-Binary,Genderqueer', 'Non-Binary', 'Non-Binary,Genderqueer,Non-Conforming',
'Man','Woman','Non-Binary','Man','Woman', 'Non-Binary,Genderqueer', 'Non-Binary,Genderqueer,Non-Conforming'],
'Count': [7996,915,11,34,153,3857,287,47,2566,272,72,99]})
df_genders
Since the values of Gender are not very consistent, I would like to group them together and sum their Counts in order to obtain for each country the sum of Men, Women and Non-Binary (Non-Binary being nor "Man" nor "Woman").
I wasn't able to write a code for conditional grouping and summing, therefore my approach was to find out the totals per Country and then subtract from the totals the sum of man + woman, being thus left with the sum of non-binary:
df_genders.groupby('Country')['Count'].sum() - df_genders[(df_genders['Gender']=='Man') | (df_genders['Gender']=='Woman')].groupby('Country')['Count'].sum()
Do you know a better way to solve this or in general a way for performing conditional aggregations (group by and sum)?
Thank you!

You could do directly:
res = df_genders[~df_genders['Gender'].isin(('Man', 'Woman'))]['Count'].sum()
print(res)
Output
416
But I think is better if you create a new column with the classification you are seeking, for example, one approach:
df_genders['grouped-genders'] = df_genders['Gender'].map({ 'Man' : 'Man', 'Woman' : 'Woman' }).fillna('Non-Binary')
print(df_genders)
Output
Country Gender Count grouped-genders
0 US Man 7996 Man
1 US Woman 915 Woman
2 US Non-Binary,Genderqueer 11 Non-Binary
3 US Non-Binary 34 Non-Binary
4 US Non-Binary,Genderqueer,Non-Conforming 153 Non-Binary
5 India Man 3857 Man
6 India Woman 287 Woman
7 India Non-Binary 47 Non-Binary
8 UK Man 2566 Man
9 UK Woman 272 Woman
10 UK Non-Binary,Genderqueer 72 Non-Binary
11 UK Non-Binary,Genderqueer,Non-Conforming 99 Non-Binary
and then group by the new column to obtain the count of all the genders:
res = df_genders.groupby('grouped-genders')['Count'].sum().reset_index()
print(res)
Output
grouped-genders Count
0 Man 14419
1 Non-Binary 416
2 Woman 1474

Related

How can I compute a rolling sum using groupby in pandas?

I'm working on a fun side project and would like to compute a moving sum for number of wins for NBA teams over 2 year periods. Consider the sample pandas dataframe below,
pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
I would ideally like to compute the sum of the number of wins between 1970 and 1971, 1971 and 1972, 1972 and 1973, etc. An inefficient way would be to use a loop, is there a way to do this using the .groupby function?
This is a little bit of a hack, but you could group by df['Season'] // 2 * 2, which means dividing by two, taking a floor operation, then multiplying by two again. The effect is to round each year to a multiple of two.
df_sum = pd.DataFrame(df.groupby(['Team', df['Season'] // 2 * 2])['Wins'].sum()).reset_index()
Output:
Team Season Wins
0 Hawks 1970 74
1 Hawks 1972 76
2 Hawks 1974 42
If you have years ordered for each team you can just use rolling with groupby on command. For example:
import pandas as pd
df = pd.DataFrame({'Team':['Hawks','Hawks','Hawks','Hawks','Hawks'], 'Season':[1970,1971,1972,1973,1974],'Wins':[40,34,30,46,42]})
res = df.groupby('Team')['Wins'].rolling(2).sum()
print(res)
Out:
Team
Hawks 0 NaN
1 74.0
2 64.0
3 76.0
4 88.0

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

using groupby for datetime values in pandas

I'm using this code in order to groupby my data by year
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df_duplicate_name = df[df.duplicated(['name'])]
df = df.drop_duplicates(subset='name').reset_index()
df = df.drop(['a','type','index'],axis=1).reset_index()
df = df[~df['foundation'].str.contains('[A-Za-z]', na=False)]
df = df.drop([140,214,220])
df['foundation'] = df['foundation'].fillna(0)
df['foundation'] = pd.to_datetime(df['foundation'])
df['foundation'] = df['foundation'].dt.year
df = df.groupby('foundation')
But as a result it does not group it by foundation values:
0 0 Deutsche EuroShop AG 1999 http://dbpedia.org/resource/Germany Investment in shopping centers http://dbpedia.org/resource/Real_property 4 2.964E9 1.25E9 2.241E8 8.04E7
1 1 Industry of Machinery and Tractors 1996 http://dbpedia.org/resource/Belgrade http://dbpedia.org/resource/Tractors http://dbpedia.org/resource/Agribusiness 4 4.648E7 0.0 30000.0 -€0.47 million
2 2 TelexFree Inc. 2012 http://dbpedia.org/resource/Massachusetts 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
3 3 (prev. Common Cents Communications Inc.) 2012 http://dbpedia.org/resource/United_States 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
4 4 Bionor Holding AS 1993 http://dbpedia.org/resource/Oslo http://dbpedia.org/resource/Health_care http://dbpedia.org/resource/Biotechnology 18 NOK 253 395 million NOK 203 320 million 1.09499E8 NOK 49 020 million
... ... ... ... ... ... ... ... ... ... ... ...
255 255 Ageas SA/NV 1990 http://dbpedia.org/resource/Belgium http://dbpedia.org/resource/Insurance http://dbpedia.org/resource/Financial_services 45000 1.0872E11 1.348E10 1.112E10 9.792E8
256 256 Sharp Corporation 1912 http://dbpedia.org/resource/Japan Televisions, audiovisual, home appliances, inf... http://dbpedia.org/resource/Consumer_electronics 52876 NaN NaN NaN NaN
257 257 Erste Group Bank AG 2008 Vienna, Austria Retail and commercial banking, investment and ... http://dbpedia.org/resource/Financial_services 47230 2.71983E11 1.96E10 6.772E9 1187000.0
258 258 Manulife Financial Corporation 1887 200 Asset management, Commercial banking, Commerci... http://dbpedia.org/resource/Financial_services 34000 750300000000 47200000000 39000000000 4800000000
259 259 BP plc 1909 London, England, UK http://dbpedia.org/resource/Natural_gas http://dbpedia.org/resource/Petroleum_industry
I also tried with making it again pd.to_datetime and sorting by dt.year - but still unsuccessful.
Column names:
Index(['index', 'name', 'foundation', 'location', 'products', 'sector',
'employee', 'assets', 'equity', 'revenue', 'profit'],
dtype='object')
#Ruslan you simply need to use a "sorting" command, not a "groupby" . You can achieve this generally in two ways:
myDF.sort_value(by='column_name' , ascending= 'true', inplace=true)
or, in case you need to set your column as index, you would need to do this:
myDF.index.name = 'column_name'
myDF.sort_index(ascending=True)
GroupBy is a totally different command, it is used to make actions after you group values by some criteria. Such as find sum, average , min, max of values, grouped-by some criteria.
pandas.DataFrame.sort_values
pandas.DataFrame.groupby
I think you're misunderstanding how groupby() works.
You can't do df = df.groupby('foundation'). groupby() does not return a new DataFrame. Instead, it returns a GroupBy, which is essentially just a mapping from value grouped-by to a dataframe containg the rows that all share that value for the specified column.
You can, for example, print how many rows are in each group with the following code:
groups = df.groupby('foundation')
for val, sub_df in groups:
print(f'{val}: {sub_df.shape[0]} rows')

SQL condition based on number on field name

I have a table containing numbers of people in each area, by age.
There is a column for each age, as shown in this table (junk data):
Area
0
1
2
3
...
90+
A
123
65
45
20
--
66
B
442
456
124
422
--
999
C
442
99
88
747
--
234
I need to group these figures into age bands (0-19. 20-39, 40-59...)
eg:
Area
0-19
20-39
40-59
60+
A
789
689
544
1024
B
1564
884
1668
1589
C
800
456
456
951
What is the best way to do this?
I could do a simple SUM as below, but that feels like a massive amount of script for something that feels like it should be straightforward.
SELECT
[0]+[1]+[2]+...[19] AS [0-19],
[20]+[21]+[22]+...[39] AS [20-39]
...
Is there a simpler way? I'm wondering if PIVOT can help but am struggling to visualise how to use it to get my desired result.
Hoping I'm missing something obvious!
EDIT This is how the data has been supplied to me, I know it's not a great table design but unfortunately that's out of my hands.
I would suggest creating a view on top of your table like so:
CREATE VIEW v_t_normal
SELECT Area, Age, Value
FROM t
CROSS APPLY (VALUES
(0, [0])
(1, [1])
...
(90, [90+])
) AS ca(Age, Value)
That view will normalize present your data in somewhat normalized form. You will not be able to edit the data in the view but you should be able to perform basic math and aggregation on the data. The 90+ value will still cause headache as it encapsulates more than one value.
I'm going to answer by suggesting an alternative table design which will make life easier:
Area | Age | Count
A | 0 | 123
A | 1 | 65
...
B | 0 | 442
Here we are storing each area's age in a separate record, rather than column. With this design in place, your ask is easy to come by using conditional aggregation:
SELECT
Area,
SUM(CASE WHEN Age BETWEEN 0 AND 19 THEN Count ELSE 0 END) AS [0-19],
SUM(CASE WHEN Age BETWEEN 20 AND 39 THEN Count ELSE 0 END) AS [20-39],
...
FROM yourNewTable
GROUP BY Area;

Multilevel Indexing with Groupby

Being new to python I'm struggling to apply other questions about the groupby function to my data. A sample of the data frame :
ID Condition Race Gender Income
1 1 White Male 1
2 2 Black Female 2
3 3 Black Male 5
4 4 White Female 3
...
I am trying to use the groupby function to gain a count of how many black/whites, male/females, and income (12 levels) there are in each of the four conditions. Each of the columns, including income, are strings (i.e., categorical).
I'd like to get something such as
Condition Race Gender Income Count
1 White Male 1 19
1 White Female 1 17
1 Black Male 1 22
1 Black Female 1 24
1 White Male 2 12
1 White Female 2 15
1 Black Male 2 17
1 Black Female 2 19
...
Everything I've tried has come back very wrong so I don't think I'm anywhere near right, but I"m been using variations of
Data.groupby(['Condition','Gender','Race','Income'])['ID'].count()
When I run the above line I just get a 2 column matrix with an indecipherable index (e.g., f2df9ecc...) and the second column is labeled ID with what appear to be count numbers. Any help is appreciated.
if you would investigate the resulting dataframe you would see that the columns are inside the index so just reset the index...
df = Data.groupby(['Condition','Gender','Race','Income'])['ID'].count().reset_index()
that was mainly to demonstrate but since you what you want you can sepcify the argument 'as_index' as following:
df = Data.groupby(['Condition','Gender','Race','Income'],as_index=False)['ID'].count()
also since you want the last column to be 'count' :
df = df.rename(columns={'ID':'count'})