Fast Fourier Transform (fft) aggregation on Spark Dataframe groupby - numpy

I am trying to get the fft over a window using numpy fft with spark dataframe like this:
import numpy as np
df_grouped = df.groupBy(
"id",
"type",
"mode",
func.window("timestamp", "10 seconds", "5 seconds"),
).agg(
percentile_approx("value", 0.25).alias("quantile_1(value)"),
percentile_approx("magnitude", 0.25).alias("quantile_1(magnitude)"),
percentile_approx("value", 0.5).alias("quantile_2(value)"),
percentile_approx("magnitude", 0.5).alias("quantile_2(magnitude)"),
percentile_approx("value", 0.75).alias("quantile_3(value)"),
percentile_approx("magnitude", 0.75).alias("quantile_3(magnitude)"),
avg("value"),
avg("magnitude"),
min("value"),
min("magnitude"),
max("value"),
max("magnitude"),
kurtosis("value"),
kurtosis("magnitude"),
var_samp("value"),
var_samp("magnitude"),
stddev_samp("value"),
stddev_samp("magnitude"),
np.fft.fft("value"),
np.fft.fft("magnitude"),
np.fft.rfft("value"),
np.fft.rfft("magnitude"),
)
Every aggregation function works fine, however for the fft I get:
tuple index out of range
and I don't understand why. Do I need to do anything particular to the values in order for numpy fft to work? The values are all floats. When I print the column it looks like this:
[Row(value_0=6.247499942779541), Row(value_0=63.0), Row(value_0=54.54375076293945), Row(value_0=0.7088077664375305), Row(value_0=51.431251525878906), Row(value_0=0.09377499669790268), Row(value_0=0.09707500040531158), Row(value_0=6.308750152587891), Row(value_0=8.503950119018555), Row(value_0=295.8463134765625), Row(value_0=7.938048839569092), Row(value_0=8.503950119018555), Row(value_0=0.7090428471565247), Row(value_0=0.7169944643974304), Row(value_0=0.5659012794494629)]
I am guessing the spark row might be an issue, but I am unsure of how to convert it in this context.

np.fft.fft is a numpy function, not a pyspark function. Therefore, you cannot apply it directly to a dataframe.
Moreover, it takes as entry an array. "value" is a string. The function fft cannot infer that as being the aggregated list of the values of the column "value". You have to do it manually.
from pyspark.sql import functions as F, types as T
df_grouped = df.groupBy(
"id",
"type",
"mode",
func.window("timestamp", "10 seconds", "5 seconds"),
).agg(
F.percentile_approx("value", 0.25).alias("quantile_1(value)"),
...,
F.stddev_samp("magnitude"), # I replace the np.fft.fft with a collect_list
F.collect_list("value").alias("values"),
F.collect_list("magnitude").alias("magnitudes"),
)
# Definition of the UDF fft. Do the same for rfft
#F.udf(T.ArrayType(T.FloatType()))
def fft_udf(array):
return [float(x) for x in np.fft.fft(array)]
# Do that for all your columns.
df_grouped.withColumn("ftt_values", fft_udf(F.col("values")))

Related

group by in pandas API on spark

I have a pandas dataframe below,
data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
Here df is a Pandas dataframe.
I am trying to convert this dataframe to pandas API on spark
import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))
Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'>
'
No I am applying group by function on pdf like below,
for i,j in pdf.groupby("Team"):
print(i)
print(j)
I am getting an error below like
KeyError: (0,)
Not sure this functionality will work on pandas API on spark ?
The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.
If you want to print the groups, then pyspark pandas code:
pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))
is equivalent to pandas code:
for name, sub_grp in df.groupby("Team"):
print(name)
print(sub_grp)
Reference to source code
If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html
but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816
Documentation reference to iterate groups
pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration
pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups
If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas
for i in df.groupby("Team"):
print(i)
Or
for i in pdf.groupBy("Team"):
print(i)

filter a pivot table pandas

I am trying to filter a pivot table based on adding filter to pandas pivot table but it doesn't work.
maintenance_part_consumptionDF[(maintenance_part_consumptionDF.Asset == 'A006104') & (maintenance_part_consumptionDF.Reason == 'R565')].pivot_table(
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
But shows, TypeError: pivot_table() got multiple values for argument 'values'
Update
Creation of the pivot table:
import numpy as np
maintenance_part_consumption_pivottable_part = pd.pivot_table(
maintenance_part_consumptionDF,
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
maintenance_part_consumption_pivottable_part.head(2)
When I slice it manually:
maintenance_part_consumption_pivottable_partDF=pd.DataFrame(maintenance_part_consumption_pivottable_part)
maintenance_part_consumption_pivottable_partDF.iloc[15,[8]]
I get this output:
Reason Part R565 38 Name: A006104, dtype: int64
Which is the exact output I need.
But I don't want to do this way with iloc because it's more mechanical I have to count the number of y row indexes or/and x row indexes before getting to the result "38".
Hint: If I could take the asset description itsel and also for the reason from the table like it's asked on the question below.
How many unique parts were used for the asset A006104 for the failure
reason R565?
Sorry, I wanted to upload the table via an image to make it more realistic but I am not allowed.
If you read the documentation for DataFrame.pivot_table, the first parameter is values, so in your code:
.pivot_table(
maintenance_part_consumptionDF, # this is `values`
values=["Quantity", "Part"], # this is also `values`
...
)
Simply drop the furst argument:
.pivot_table(
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
There is also a closely related function: pd.pivot_table whose first parameter is a dataframe. Don't mix up the two

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

Flatten and rename multi-index agg columns

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.
In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this: