Multplying groupby elements for each group in survey - pandas

I am working on Stack Overflow 2019 Survey data. here is Survey 2019 data.
There are lots of columns in that data.
I want to carry out this calculation ---> "Sum of Age1stCode" / "Number of people who are related years old".
Age1stCode is a column in survey illustrates a first year of coding. Age is a column of "age years old".
I have created a group according to "Age".
I just want to multiply each opposing number and then to sum them. For instance, for age 11 = (6x3)+(7x3)+ (9x2)+.......(8x1). I want to to do this for each age group. So at the end, I want to achieve an output like the file I attached "Age 11.0 ----> 326 (it is just random for example), Age 12.0 ---> 468)
My goal is to calculate this ---> Sum of Age1stCode for each age group.
here is the output that I want to work with. Attached File.

df_grouped = df.groupby('Age').agg({'Age1stCode': 'sum'}).reset_index()
new_col = df_grouped['Age1stCode'] / df_grouped['Age']

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Loops in Dataframe

I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')
If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)

How to select columns based on value they contain pandas

I am working in pandas with a certain dataset that describes the population of a certain country per year. The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set. The dataset describes every year from 1960 up til now but I only need 1970, 1980, 1990 etc. For this purpose I've created a list with all those years and tried to make a new dataset which is equivalent to the old one but only has the columns that contain a value from said list so I don't have all this extra info I'm not using. Online I can only find instructions for removing rows or selecting by column name, since both these criteria don't apply in this situation I thought i should ask here.
The dataset is a csv file which I've downloaded off some world population site. here a link to a screenshot of the data
As you can see the years are given in scientific notation for some years, which is also how I've added them to my list.
pop = pd.read_csv('./maps/API_SP.POP.TOTL_DS2_en_csv_v2_10576638.csv',
header=None, engine='python', skiprows=4)
display(pop)
years = ['1.970000e+03','1.980000e+03','1.990000e+03','2.000000e+03','2.010000e+03','2.015000e+03', 'Country Name']
pop[pop.columns[pop.isin(years).any()]]
This is one of the things I've tried so far which I thought made the most sense, but I am still very new to pandas so any help would be greatly appreciated.
Using the data at https://data.worldbank.org/indicator/sp.pop.totl, copied into pastebin (first time using the service, so apologies if it doesn't work for some reason):
# actual code using CSV file saved to desktop
#df = pd.read_csv(<path to CSV>, skiprows=4)
# pastebin for reproducibility
df = pd.read_csv(r'https://pastebin.com/raw/LmdGySCf',sep='\t')
# manually select years and other columns of interest
colsX = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1990', '1995', '2000']
dfX = df[colsX]
# select every fifth year
colsY = df.filter(regex='19|20', axis=1).columns[[int(col) % 5 == 0 for col in df.filter(regex='19|20', axis=1).columns]]
dfY = df[colsY]
As a general comment:
The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set.
This is not correct. Viewing the CSV file, it is quite clear that row 5 (Country Name, Country Code, Indicator Name, Indicator Code, 1960, 1961, ...) are indeed column names. You have read the data into pandas in such a way that those values are not column years, but your first step, before trying to subset your data, should be to ensure you have read in the data properly -- which, in this case, would give you column headers named for each year.

pandas dataframe sub setting on multiple loop conditions

I have a pandas df having users and their answers to a survey and a score eg
Userid incomebracket insurance-knowledge..... score
123 3 3 56
346 4 6 65
Assume income bracket has 6 levels with 1:1000-5000,6:100000+,similarly insurance-knowledge has 6 levels (1:very little to 6:expert)
Now I have another df which has user profile features like
userid,age,gender,education....(10 such features)
Now I iterate through set of users (first df) and for each of them want to get the entire subset of other users who have the same user profile but higher answer on each column of first df, say income. I am doing this using the following say for 3 profile features like age, gender and education
df_sameusergroup=df[(df['PPGENDER']==sameuser_gender.values[0])
& (df['EDUC']==sameuser_educ.values[0])
& (df['age']==sameuser_agecat.values[0])
& (df['incomebracket']>user_feature.values[0])]
Although this works the profile features here are hardcoded and is a problem for longer conditions,what I want is
get the subset of users who have same profile on all 10 but with higher answer, if you don't get any such record (which is possible) the reduce to 9 features,then reduce to 8,7.....2 (of the most important features say age gender). My pseudocode for this should look like this
for i in range(10:2) // iterate the Userprofile_df for all profile features,then 9,then 8...
Similaruserdf[] = df[subset when all i features are same and income is >]
if(Similaruserdf.length==0)//no such users with all features same
continue loop reduce number of features to match on
else
return Similaruserdf[]
I am stuck trying to do this and have been looking throughout to find a solution. Any help would be greatly appreciated. Thanks.

SSRS 2008 display mutilple columns of data without a new line

I am creating a report in SSRS 2008 with MS SQL Server 2008 R2. I have data based on the Aggregate value of Medical condition and the level of severity.
Outcome Response Adult Youth Total
BMI GOOD 70 0 70
BMI MONITOR 230 0 230
BMI PROBLEM! 10 0 10
LDL GOOD 5 0 5
LDL MONITOR 4 0 4
LDL PROBLEM! 2 0 2
I need to display the data based on the Response like:
BMI BMI BMI
GOOD MONITOR PROBLEM!
Total 70 230 10
Youth 0 0 0
Adult 70 230 10
LDL LDL LDL
GOOD MONITOR PROBLEM!
Total 5 4 2
Youth 0 0 0
Adult 5 4 2
I first tried to use SSRS to do the grouping based on the Outcome and then the Response but I got each response on a separate row of data but I need all Outcomes on a single line. I now believe that a pivot would work but all the examples I have seen is a pivot on one column of data pivoted using another. Is it possible to pivot multiple columns of data based on a single column?
With your existing Dataset you could so something similar to the following:
Create a List item, and change the Details grouping to be based on Outcome:
In the List cell, add a new Matrix with one Column Group based on Response:
You'll note that since you have individual columns for Total, Youth, Adult, you need to add grand total rows to display each group.
The end result is pretty close to your requirements:
For your underlying data, to help with report development it might be useful to have the Total, Youth, Adult as unpivoted columns, but it's not a big deal if the groups are fairly static.