How to use previous row value in next row in pyspark dataframe - dataframe

I have a pyspark dataframe and I want to perform calculation as
for i in range(0,(length-1)):
x[i] = (x[i-1] - y[i-1]) * np.exp(-(t[i] -t[i-1])/v[i-1]) + y[i-1]
Where x,y,t and v are lists of float type columns created using
x = df.select(‘col_x’).rdd.flatMap(lambda x:x).collect()
And similarly y,t and v for respective columns.
This method works but not efficiently for data in bulk.
I want to perform this calculation in pyspark dataframe column. I want to update x column after every row and then use that updated value for calculating next row.
I have created columns to get previous row using lag as
df = df.withColumn(prev_val_x),F.lag(df.x,1).over(my_window)
And then calculating and updating x as -
df = df.withColumn(‘x’,(col(‘prev_val_x’) - col(‘prev_val_y’))
but it does not update the value with previous row value.
Creating lists for 4 columns using collect() takes a lot of time thus gives a memory error. Therefore, want to calculate within the dataframe column itself. Let column x has values as- 4.38,0,0,0,…till the end. X column only has value in its first row and then has 0 filled in all rows. Y,t and v has float values in it.
How do I proceed with this?
Any help would be appreciated!

Related

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

Conditional row number Pandas

I need to add row number to my dataframe based on certain condition, below is the image input data frame.
I need a row number column in my dataframe as illustrated in below image(Rank column).
so when ever "RequestResubmitted" value is found within group I want reset rank to 1 again.
Let us try cumsum create the cub key and groupby + cumcount
s=df.groupby([df['Word Order Code'],df['Status Code'].eq('Request Submitted').cumsum()]).cumcount()+1
df['rank']=s

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

Python and Pandas: apply per multiple columns

I am new to python and I was successful in using apply in a dataframe to create a new column inside a dataframe.
X['Geohash']=X[['Lat','Long']].apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
this is calling the geohash function with the latitudes and longitudes per row and per column.
Now I have two new data frames one for Latitude and one for Longitude.
Each dataframe has twenty columns and I want that the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
is called twenty times.
-First time the first
dataframe-Latitude column with the first dataframe-Longitude column then
-the second time, the second dataframe-Latitude column with the second dataframe-Longitude column.
How I can do this iteration per column and at each iteration call the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
What I want to get is a new dataframe with twenty columns with each column being the result of the geohash function.
Ideas will be appreciated.
You can do this by creating an "empty" dataframe with 20 columns, and then using df.columns[i] to loop through your other dataframes - something like this:
output = pd.DataFrame({i:[] for i in range(20)})
This creates an empty dataframe with all the columns you wanted (numbered).
Now, let's say longitude and latitude dataframes are called 'lon' and 'lat'. We need to join them into one dataframe Then:
lonlat = lat.join(lon)
for i in range(len(output.columns)):
output[output.columns[i]] = lonlat.apply(lambda column: geohash.encode(column[lat.columns[i]],
column[lon.columns[i]],
precision=8), axis=1)