Combining multiple dataframe columns into a single time series - dataframe

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?

To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Related

Loops in Dataframe

I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')
If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)

Multplying groupby elements for each group in survey

I am working on Stack Overflow 2019 Survey data. here is Survey 2019 data.
There are lots of columns in that data.
I want to carry out this calculation ---> "Sum of Age1stCode" / "Number of people who are related years old".
Age1stCode is a column in survey illustrates a first year of coding. Age is a column of "age years old".
I have created a group according to "Age".
I just want to multiply each opposing number and then to sum them. For instance, for age 11 = (6x3)+(7x3)+ (9x2)+.......(8x1). I want to to do this for each age group. So at the end, I want to achieve an output like the file I attached "Age 11.0 ----> 326 (it is just random for example), Age 12.0 ---> 468)
My goal is to calculate this ---> Sum of Age1stCode for each age group.
here is the output that I want to work with. Attached File.
df_grouped = df.groupby('Age').agg({'Age1stCode': 'sum'}).reset_index()
new_col = df_grouped['Age1stCode'] / df_grouped['Age']

Understanding Correlation Between Columns Pandas DataFrame

I have a dataset with daily sales of two products for the first 10 days of their release. The dataframe below shows a single and dozens of items being sold per day for each product. Its believed that no dozens product was sold before a single item of the product had been sold. The two products (Period_ID) has expected number of dozens sale.
d = {'Period_ID':['A12']*10, 'Prod_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df = pd.DataFrame(data=d)
QUESTION
I want to construct a descriptive analysis in which one of my questions is to figure out how many single items of each product sold in average before a dozen was sold the 1st time, 2nd time,..., 10th time?
Given that df.Period_ID.nunique() = 1568
Modifying the dataset for sales per day as oppose to the above cumulative sales and using Pankaj Joshi solution with small alteration,
print(f'Average number of single items before {index + 1} dozen = {df1.A_Singles[:val+1].mean():0.2f}')
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,0,1,0,1,0,1], 'B_Singles':[0,0,1,0,1,0,1,0,1,0],
'A_Dozens':[0,0,0,0,0,0,0,1,0,0], 'B_Dozens':[0,0,0,0,0,0,1,0,1,0]}
df1 = pd.DataFrame(data=d)
# For product A
Average number of single items before 1 dozen = 0.38
# For product B
6
Average number of single items before 1 dozen = 0.43
8
Average number of single items before 2 dozen = 0.44, But I want this to be counted from the last Dozens of sales. so rather 0.44, it should be 0.5
The aim is once I have the information for each Period_ID then i will take the average for all df.Period_ID.nunique() (= 1568) and try to optimise the expected number of 'Dozens' sale for each product given under the col Prod_A_Doz and Prod_B_Doz
I would appreciate all the help.
Here is how I will go about it:
d = {'Period_ID':['A12']*10, 'Prob_A_Doz':[1.2]*10, 'Prod_B_Doz':[2.4]*10, 'A_Singles':[0,0,0,1,1,2,2,3,3,4], 'B_Singles':[0,0,1,1,2,2,3,3,4,4],
'A_Dozens':[0,0,0,0,0,0,0,1,1,1], 'B_Dozens':[0,0,0,0,0,0,1,1,2,2]}
df1 = pd.DataFrame(data=d)
for per_id in set(df1.Period_ID):
print(per_id)
df_temp = df1[df1.Period_ID == per_id]
for index, val in enumerate(df_temp.index[df_temp.A_Dozens>0]):
print(val)
print(f'Average number of single items before {index} dozen = {df_temp.A_Singles[:val].mean():0.2f}')
print(f'Average number of single items before {index} dozen = {df_temp.B_Dozens[:val].mean():0.2f}')

SSRS: Summing Lookupset not working

I'm currently working on a report where I'm given 3 different datasets. The report essentially calculates the input, output and losses of a given food production process.
In dataset "Spices", contains the quantity of spices used under a field named "Qty_Spice". In dataset "Meat", contains the quantity of meat used under a field named "Qty_Meat". In dataset "Finished", contains the quantity of finished product used under a field "Qty_Finished".
I'm currently trying to create a table where the amount of input (spice+meat) is compared against output (finished product), such that the table looks like this:
Sum of Inputs (kg) | Finished Product (kg) | Losses (kg)
10 8 2
8 5 3
Total:
18 13 5
What I'm currently doing is using lookupset to get all inputs of both spices and meats (using lookupset instead of lookup because there are many different types of meats and spices used), then using a custom code named "Sumlookup" to sum the quantities lookupset returned.
The problem I'm having is that when I want to get the total sum of all inputs and all finished products (bottom of the table) using "Sumlookup" the table is only returning the first weight it finds. In the example above, it would return, 10, 8 and 2 as inputs, finished products and losses respectively.
Does anyone know how I should approach solving this?
Really appreciate any help
Here is the custom code I used for SumLookUp:
Public Function SumLookup(ByVal items As Object()) As Decimal
Dim suma As Decimal = 0
For Each item As Decimal In items
suma += item
Next
Return suma
End Function

How to compute the difference in monthly income for the same id

The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()