Loops in Dataframe

Loops in Dataframe - pandas

I have 4 columns: Country, Year, GDP Annual Growth and Field Size in MM Barrels.
I am looking for a way to create a loop function that generates the mean GDP growth values over the 5 years following the discovery of a field ("Field Size MM Barrels"). Example: In 1961 a discovery was made in Algeria and its size is 2462. What is the average GDP annual growth value over the next following 5 years (1962-1967)?.
NaN refers to years where no discoveries were made in this case. I would like the loop to add the mean value each time in a column next to Field Size. Any idea how to do that?
Country,Year,GDP Annual Growth,Field_Size_MM_Barrels
Algeria,1961,-13.605441,2462.0
Algeria,1962,-19.685042,2413.0
Algeria,1963,34.313729,NaN
Algeria,1964,5.839413,NaN
Algeria,1965,6.206898,500.0
Yemen,2016,-13.621458,NaN
Yemen,2017,-5.942320,NaN
Yemen,2018,-2.701475,NaN
Divided Neutral Zone: Kuwait/Saudi Arabia,1963,NaN,832.0
Divided Neutral Zone: Kuwait/Saudi Arabia,1967,NaN,1566.0
# read in with
df = pd.read_clipboard(sep=',')

If you could include a sample of the dataframe (say first 20 rows) then it will help answer/test answers. Here's a possible starting point:
# create a list for average GDP values
average = []
# go over all rows in df.values
for row_id in range(1, len(self.df.values)):
test = self.df.iloc[row_id]["Field Size MM Barrels"]
if (test == 'NaN'):
row_list = []
# create a row list to average over:
for i in range(1+row_id,6+row_id):
row_list.append(i)
average = df[["GDP"]].iloc[row_list].mean(axis=0)

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?

To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

How to calculate P value in presto database?

I have a table like this that captures the experiment data:
treatment metric values
control metric values
1
2
3
6
5
7
...
...
I want to calculate the P value for the experiment in Presto using SQL. I can take average of metric values for both treatment and control groups to compare but I need P-value to see if the results are statistically significant.

Given your data format, assuming equal population sizes, all users in the experiment are in the data set, etc:
SELECT
NORMAL_CDF(
ABS(AVG("treatment metric values") - AVG("control metric values")),
SQRT(VAR_SAMP("treatment metric values") + VAR_SAMP("control metric values")),
0
) AS p_value

if statement: return NaN if number of data points is less than x else calculate annualized return for dataframe

I have a dataframe that contains monthly log returns of different funds. Dataframe consists of more than 100 columns and 500 rows. For dates for which return is not available np.nan value is reported. I am trying to calculate annualized return for the last 3 years for each fund and I am using the below function:
def calculate_3Y_return(data):
data_cumsum = data.iloc[-36:].cumsum()
return_3Y = data_cumsum.iloc[-1]*(1/3)
return_3Y
For funds that do not have enough data points (their track record is shorter than 36 months) I would like my function to return NaN.
Can you help me with adding if statement to my function that will calculate 3 year annualized return for funds that have 36 data points available and return NaN for funds that have less than 36 data points (these funds were launched recently and therefore number of available data points is less than 36).
I already tried to used dataframe.count() but it didn’t help
Many thanks

Creating similar samples based on three different categorical variables

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help

You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?

For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Loops in Dataframe - pandas

Related

Combining multiple dataframe columns into a single time series

How to calculate P value in presto database?

if statement: return NaN if number of data points is less than x else calculate annualized return for dataframe

Creating similar samples based on three different categorical variables

Mapping column values to a combination of another csv file's information

Categories

Resources