I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')
If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)
I am working on Stack Overflow 2019 Survey data. here is Survey 2019 data.
There are lots of columns in that data.
I want to carry out this calculation ---> "Sum of Age1stCode" / "Number of people who are related years old".
Age1stCode is a column in survey illustrates a first year of coding. Age is a column of "age years old".
I have created a group according to "Age".
I just want to multiply each opposing number and then to sum them. For instance, for age 11 = (6x3)+(7x3)+ (9x2)+.......(8x1). I want to to do this for each age group. So at the end, I want to achieve an output like the file I attached "Age 11.0 ----> 326 (it is just random for example), Age 12.0 ---> 468)
My goal is to calculate this ---> Sum of Age1stCode for each age group.
here is the output that I want to work with. Attached File.
df_grouped = df.groupby('Age').agg({'Age1stCode': 'sum'}).reset_index()
new_col = df_grouped['Age1stCode'] / df_grouped['Age']
I have a dataset with daily sales of two products for the first 10 days of their release. The dataframe below shows a single and dozens of items being sold per day for each product. Its believed that no dozens product was sold before a single item of the product had been sold. The two products (Period_ID) has expected number of dozens sale.
d = {'Period_ID':['A12']*10, 'Prod_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df = pd.DataFrame(data=d)
QUESTION
I want to construct a descriptive analysis in which one of my questions is to figure out how many single items of each product sold in average before a dozen was sold the 1st time, 2nd time,..., 10th time?
Given that df.Period_ID.nunique() = 1568
Modifying the dataset for sales per day as oppose to the above cumulative sales and using Pankaj Joshi solution with small alteration,
print(f'Average number of single items before {index + 1} dozen = {df1.A_Singles[:val+1].mean():0.2f}')
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,0,1,0,1,0,1], 'B_Singles':[0,0,1,0,1,0,1,0,1,0],
'A_Dozens':[0,0,0,0,0,0,0,1,0,0], 'B_Dozens':[0,0,0,0,0,0,1,0,1,0]}
df1 = pd.DataFrame(data=d)
# For product A
Average number of single items before 1 dozen = 0.38
# For product B
6
Average number of single items before 1 dozen = 0.43
8
Average number of single items before 2 dozen = 0.44, But I want this to be counted from the last Dozens of sales. so rather 0.44, it should be 0.5
The aim is once I have the information for each Period_ID then i will take the average for all df.Period_ID.nunique() (= 1568) and try to optimise the expected number of 'Dozens' sale for each product given under the col Prod_A_Doz and Prod_B_Doz
I would appreciate all the help.
Here is how I will go about it:
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df1 = pd.DataFrame(data=d)
for per_id in set(df1.Period_ID):
print(per_id)
df_temp = df1[df1.Period_ID == per_id]
for index, val in enumerate(df_temp.index[df_temp.A_Dozens>0]):
print(val)
print(f'Average number of single items before {index} dozen = {df_temp.A_Singles[:val].mean():0.2f}')
print(f'Average number of single items before {index} dozen = {df_temp.B_Dozens[:val].mean():0.2f}')
I'm currently working on a report where I'm given 3 different datasets. The report essentially calculates the input, output and losses of a given food production process.
In dataset "Spices", contains the quantity of spices used under a field named "Qty_Spice". In dataset "Meat", contains the quantity of meat used under a field named "Qty_Meat". In dataset "Finished", contains the quantity of finished product used under a field "Qty_Finished".
I'm currently trying to create a table where the amount of input (spice+meat) is compared against output (finished product), such that the table looks like this:
Sum of Inputs (kg) | Finished Product (kg) | Losses (kg)
10 8 2
8 5 3
Total:
18 13 5
What I'm currently doing is using lookupset to get all inputs of both spices and meats (using lookupset instead of lookup because there are many different types of meats and spices used), then using a custom code named "Sumlookup" to sum the quantities lookupset returned.
The problem I'm having is that when I want to get the total sum of all inputs and all finished products (bottom of the table) using "Sumlookup" the table is only returning the first weight it finds. In the example above, it would return, 10, 8 and 2 as inputs, finished products and losses respectively.
Does anyone know how I should approach solving this?
Really appreciate any help
Here is the custom code I used for SumLookUp:
Public Function SumLookup(ByVal items As Object()) As Decimal
Dim suma As Decimal = 0
For Each item As Decimal In items
suma += item
Next
Return suma
End Function
The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()