Normalisation or scaling of a column in pyspark - apache-spark-sql

I want to scale a particular column in pyspark. In this case i want to do scaling in results column.My data frame looks like -
id age results
1 28 98
2 27 12
3 28 99
4 28 5
5 27 54
I have done so far -
df = spark.createDataFrame(
[(1,28,98),(2,27,12),(3,28,99),(4,28,5),(5,27,54)],
("id","age","results"))
minmax_result = df.groupBy("id").agg(min("results").alias("min_results"),max("results").alias("max_results))
final_df = minmax_result.join(df,["id"]).select(
((col("results") - col("min_results")) / col("min_results"))).alias("scaled_results"))
final_df.show()
it gives me like -
id age results scaled_results
1 28 98 null
2 27 12 null
3 28 99 null
4 28 5 null
5 27 54 null

I'm assuming you're planning to scale the column across all ids, so you won't be needing the groupby operation, unless you're going the UDF route. I'd suggest going with the following:
min = df.agg({"results": "min"}).collect()[0][0]
max = df.agg({"results": "max"}).collect()[0][0]
df_scaled = df_test.withColumn('scaled_results', (col('results') - min)/max)
I presume you're dividing each cell by the min value instead of the max value by mistake, but that might the use case as well.

you can use StandardScaler function in Pyspark Mllib something like this :
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
scalerModel = scaler.fit(new_df)
scaledData = scalerModel.transform(new_df)
Refer : https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Upvote if works

Related

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Using apply for multiple columns

I need to create 2 new columns based one existing 2 columns. I am trying to do it using 1 single apply function instead of 2 apply functions separately.
Initial Df for example is as follows:
ID1 ID2
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
Next I try to create 2 new columns using the below method:
def funct(row):
list1 = row.values
print(list1[0])
return row
df[['s1','s2']] = df[['ID1',"ID2"]].apply(lambda row: funct(row))
The issue is that I want to access the values individually which I am unable to do so . Here i tried converting into list but when I do list[0] i get
1
11
How to access 1 and 11 above? how should I index to access individual series value when I send two series together using apply?
NOTE: the content of funct() is just returning the same now because I still dont know how to access the values inorder to do something
add a parameter axis=1 to your apply function, like
import pandas as pd
from io import StringIO
s = """
,ID1,ID2
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20
"""
df = pd.read_csv(StringIO(s),index_col=0)
def funct(row):
# return row
# update the answer
return pd.Series([row.ID1 + 100, row.ID2 + 20])
df[['s1','s2']] = df[['ID1',"ID2"]].apply(funct, axis=1)

Python Dataframe column operation using lambda function [duplicate]

I'm trying to multiply two existing columns in a pandas Dataframe (orders_df): Prices (stock close price) and Amount (stock quantities) and add the calculation to a new column called Value. For some reason when I run this code, all the rows under the Value column are positive numbers, while some of the rows should be negative. Under the Action column in the DataFrame there are seven rows with the 'Sell' string and seven with the 'Buy' string.
for i in orders_df.Action:
if i == 'Sell':
orders_df['Value'] = orders_df.Prices*orders_df.Amount
elif i == 'Buy':
orders_df['Value'] = -orders_df.Prices*orders_df.Amount)
Please let me know what i'm doing wrong !
I think an elegant solution is to use the where method (also see the API docs):
In [37]: values = df.Prices * df.Amount
In [38]: df['Values'] = values.where(df.Action == 'Sell', other=-values)
In [39]: df
Out[39]:
Prices Amount Action Values
0 3 57 Sell 171
1 89 42 Sell 3738
2 45 70 Buy -3150
3 6 43 Sell 258
4 60 47 Sell 2820
5 19 16 Buy -304
6 56 89 Sell 4984
7 3 28 Buy -84
8 56 69 Sell 3864
9 90 49 Buy -4410
Further more this should be the fastest solution.
You can use the DataFrame apply method:
order_df['Value'] = order_df.apply(lambda row: (row['Prices']*row['Amount']
if row['Action']=='Sell'
else -row['Prices']*row['Amount']),
axis=1)
It is usually faster to use these methods rather than over for loops.
If we're willing to sacrifice the succinctness of Hayden's solution, one could also do something like this:
In [22]: orders_df['C'] = orders_df.Action.apply(
lambda x: (1 if x == 'Sell' else -1))
In [23]: orders_df # New column C represents the sign of the transaction
Out[23]:
Prices Amount Action C
0 3 57 Sell 1
1 89 42 Sell 1
2 45 70 Buy -1
3 6 43 Sell 1
4 60 47 Sell 1
5 19 16 Buy -1
6 56 89 Sell 1
7 3 28 Buy -1
8 56 69 Sell 1
9 90 49 Buy -1
Now we have eliminated the need for the if statement. Using DataFrame.apply(), we also do away with the for loop. As Hayden noted, vectorized operations are always faster.
In [24]: orders_df['Value'] = orders_df.Prices * orders_df.Amount * orders_df.C
In [25]: orders_df # The resulting dataframe
Out[25]:
Prices Amount Action C Value
0 3 57 Sell 1 171
1 89 42 Sell 1 3738
2 45 70 Buy -1 -3150
3 6 43 Sell 1 258
4 60 47 Sell 1 2820
5 19 16 Buy -1 -304
6 56 89 Sell 1 4984
7 3 28 Buy -1 -84
8 56 69 Sell 1 3864
9 90 49 Buy -1 -4410
This solution takes two lines of code instead of one, but is a bit easier to read. I suspect that the computational costs are similar as well.
Since this question came up again, I think a good clean approach is using assign.
The code is quite expressive and self-describing:
df = df.assign(Value = lambda x: x.Prices * x.Amount * x.Action.replace({'Buy' : 1, 'Sell' : -1}))
To make things neat, I take Hayden's solution but make a small function out of it.
def create_value(row):
if row['Action'] == 'Sell':
return row['Prices'] * row['Amount']
else:
return -row['Prices']*row['Amount']
so that when we want to apply the function to our dataframe, we can do..
df['Value'] = df.apply(lambda row: create_value(row), axis=1)
...and any modifications only need to occur in the small function itself.
Concise, Readable, and Neat!
Good solution from bmu. I think it's more readable to put the values inside the parentheses vs outside.
df['Values'] = np.where(df.Action == 'Sell',
df.Prices*df.Amount,
-df.Prices*df.Amount)
Using some pandas built in functions.
df['Values'] = np.where(df.Action.eq('Sell'),
df.Prices.mul(df.Amount),
-df.Prices.mul(df.Amount))
For me, this is the clearest and most intuitive:
values = []
for action in ['Sell','Buy']:
amounts = orders_df['Amounts'][orders_df['Action'==action]].values
if action == 'Sell':
prices = orders_df['Prices'][orders_df['Action'==action]].values
else:
prices = -1*orders_df['Prices'][orders_df['Action'==action]].values
values += list(amounts*prices)
orders_df['Values'] = values
The .values method returns a numpy array allowing you to easily multiply element-wise and then you can cumulatively generate a list by 'adding' to it.
First, multiply the columns Prices and Amount. Afterwards use mask to negate the values if the condition is True:
df.assign(
Values=(df["Prices"] * df["Amount"]).mask(df["Action"] == "Buy", lambda x: -x)
)

How to run assembled sample data

I have a pd df assembled from various samples that I randomly picked. Now, I want to run 10,000 times and get mean values for column ['MP_learning'] and ['LCC_saving'] for each row.
How should I write the code?
I tried
output=np.mean(df), but it didn't work.
PC EL MP_Learning LCC _saving
0 1 0 24 95
1 1 1 35 67
2 1 2 12 23
3 1 3 23 45
4 2 0 36 67
5 2 1 74 10
6 2 2 80 23
np.random.seed()
output=[]
for i in range (10000):
output=np.mean(df)
output
For your code, you did not post the entire code. Thus, I don't know where the data come from. However, I replicated something similar and here is the solution. For you loop code though, you suppose to append to output. Use only one of those two lines in the "for" loop code, unless you need them both.
import pandas as pd
import numpy as np
df =\
pd.DataFrame([[1,0,24,95],
[1,1,35,67],
[1,2,12,23],
[1,3,23,45],
[2,0,36,67],
[2,1,74,10],
[2,2,80,23]],
columns = ["PC","EL","MP_Learning","LCC_saving"],
index = [0,1,2,3,4,5,6]
).T
output = []
for i in range (10000):
# Use the line below to get mean for both column
output.append(np.mean([df.loc["MP_Learning"],df.loc["LCC_saving"]]))
# Use the line below to get mean for one column
output.append(np.mean(df.loc["MP_Learning"]))
print(output)

how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']