In the Dataframe, I want to compute the arccos, arcsin, arctan in the dataframe :
For exemple :
Angle A
arccos A
arcsin A
arctan
30
15
45
60
Please, I want to compute for each angle of A
Just import pyspark.sql.functions where you will find acos(), asin() and atan()
Related
I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).
How can I compute the distance below spark dataframe between Location A and B and Location A and Location C?
spark = SparkSession(sc)
df = spark.createDataFrame([('A',
40.202750,29.168350,'B',40.689247,-74.044502),('A',
40.202750,29.168350,'C',25.197197,55.274376)], ['Location1', 'Lat1',
'Long1', 'Location2', 'Lat2', 'Lon2'])
So the dataset below:
+---------+--------+--------+---------+---------+----------+
|Location1| Lat1| Long1|Location2| Lat2| Lon2|
+---------+--------+--------+---------+---------+----------+
| A|40.20275|29.16835| B|40.689247|-74.044502|
| A|40.20275|29.16835| C|25.197197| 55.274376|
+---------+--------+--------+---------+---------+----------+
Thank you
You could use the Haversine formula which goes something like
2*6378*asin(sqrt(pow(sin((lat2-lat1)/2),2) + cos(lat1)*cos(lat2)*pow(sin((lon2-lon1)/2),2)))
Furthermore, you can create a UDF for the same which would be
import pyspark.sql.functions as F
def haversine(lat1, lon1, lat2, lon2):
return 2*6378*sqrt(pow(sin((lat2-lat1)/2),2) + cos(lat1)*cos(lat2)*pow(sin((lon2-lon1)/2),2))
dist_udf=F.udf(haversine, FloatType())
Make sure your latitude and longitude values are in radians.
You could add that conversion as part of the haversine function before the calculation part
Once you have the UDF, you can straightaway do
df.withColumn('distance', dist_udf(F.col('lat1'), F.col('long1'), F.col('lat2'), F.col('long2')))
I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.
+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])
For example, I have got this Series :
17:50:51.050929 5601
17:52:15.429169 5601
17:52:19.538702 5601
17:53:44.776350 5601
17:53:51.870372 5598
17:55:33.952417 5600
17:56:48.736539 5596
17:57:01.205767 5593
17:57:26.066097 5593
17:57:30.644398 5591
I want to resample it but I want that the index start to a rounded frequency.
So in the case above, I want the first index 17:51:00 if I resample on Min frequency.
However Pandas implements it like that :
a.resample('1T', 'mean')
Out[125]:
17:50:51.050929 5601.000000
17:51:51.050929 5601.000000
17:52:51.050929 5601.000000
17:53:51.050929 5598.000000
17:54:51.050929 5600.000000
17:55:51.050929 5596.000000
17:56:51.050929 5592.333333
17:57:51.050929 NaN
How can I have a TimedeltaIndex starting from a rounded index ? Such as Timestamp resampling
A quick way to do it is to normalise the index before resampling (using either floor, ceil, or round):
a.index = a.index.floor(freq='1T')
a = a.resample('1T').mean()
I have a date time column in a Pandas DataFrame and I'd like to convert it to minutes or seconds.
For example: I want to convert 00:27:00 to 27 mins.
example = data['duration'][0]
example
result: numpy.timedelta64(1620000000000,'ns')
What's the best way to achieve this?
Use array.astype() to convert the type of an array safely:
>>> import numpy as np
>>> a = np.timedelta64(1620000000000,'ns')
>>> a.astype('timedelta64[m]')
numpy.timedelta64(27,'m')