how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python - pandas

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']

Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']

pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833

IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Pandas with a condition select a value from a column and multiply by scalar in new column, row by row

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...
Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

Normalisation or scaling of a column in pyspark

I want to scale a particular column in pyspark. In this case i want to do scaling in results column.My data frame looks like -
id age results
1 28 98
2 27 12
3 28 99
4 28 5
5 27 54
I have done so far -
df = spark.createDataFrame(
[(1,28,98),(2,27,12),(3,28,99),(4,28,5),(5,27,54)],
("id","age","results"))
minmax_result = df.groupBy("id").agg(min("results").alias("min_results"),max("results").alias("max_results))
final_df = minmax_result.join(df,["id"]).select(
((col("results") - col("min_results")) / col("min_results"))).alias("scaled_results"))
final_df.show()
it gives me like -
id age results scaled_results
1 28 98 null
2 27 12 null
3 28 99 null
4 28 5 null
5 27 54 null
I'm assuming you're planning to scale the column across all ids, so you won't be needing the groupby operation, unless you're going the UDF route. I'd suggest going with the following:
min = df.agg({"results": "min"}).collect()[0][0]
max = df.agg({"results": "max"}).collect()[0][0]
df_scaled = df_test.withColumn('scaled_results', (col('results') - min)/max)
I presume you're dividing each cell by the min value instead of the max value by mistake, but that might the use case as well.
you can use StandardScaler function in Pyspark Mllib something like this :
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
scalerModel = scaler.fit(new_df)
scaledData = scalerModel.transform(new_df)
Refer : https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Upvote if works

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8