Pandas vectorization using all cumulative sum of all previous rows and all following rows - pandas

Im trying to convert an excel sheet to Python and
part of one of the functions is (SUMSQ(K$8:K36)+SUMSQ(K38:K$39)))
this is the sum of squares of all previous rows plus the sum of squares of all following rows
my best attempt is:
def SUMSQ(x):
return ((sum([i**2 for i in x]))-(sum(x)**2)/len(x))
SUMSQ(df1['K']) + SUMSQ(df1.iloc[-1::]['K'])
I know I'm not indexing the part of the data frame correctly. Is there a way to index from the beginning of the column to the row above the current position?

Your question could use some clarification, but it seems like you want to exclude the current row from the SUMSQ calculation. If so, then the following should work.
df = pd.DataFrame({'k': range(1, 11)})
for idx in range(len(df)):
print(SUMSQ(df.loc[df.index != idx, 'k']))
Output is
60.0
68.88888888888891
75.55555555555554
80.0
82.22222222222223
82.22222222222223
80.0
75.55555555555554
68.88888888888889
60.0

Related

How to use previous row value in next row in pyspark dataframe

I have a pyspark dataframe and I want to perform calculation as
for i in range(0,(length-1)):
x[i] = (x[i-1] - y[i-1]) * np.exp(-(t[i] -t[i-1])/v[i-1]) + y[i-1]
Where x,y,t and v are lists of float type columns created using
x = df.select(‘col_x’).rdd.flatMap(lambda x:x).collect()
And similarly y,t and v for respective columns.
This method works but not efficiently for data in bulk.
I want to perform this calculation in pyspark dataframe column. I want to update x column after every row and then use that updated value for calculating next row.
I have created columns to get previous row using lag as
df = df.withColumn(prev_val_x),F.lag(df.x,1).over(my_window)
And then calculating and updating x as -
df = df.withColumn(‘x’,(col(‘prev_val_x’) - col(‘prev_val_y’))
but it does not update the value with previous row value.
Creating lists for 4 columns using collect() takes a lot of time thus gives a memory error. Therefore, want to calculate within the dataframe column itself. Let column x has values as- 4.38,0,0,0,…till the end. X column only has value in its first row and then has 0 filled in all rows. Y,t and v has float values in it.
How do I proceed with this?
Any help would be appreciated!

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)

Iterating through a pandas dataframe to create a chart that adds up to 100%

I have the following dataframe
I want add two columns "Stat total during quarter"(total value of "stat" without break down by param applied) and "% of quarter total" -- that would show how proportions have been changing over time and build a stacked chart that adds up to 100%
Unfortunately I have troubles calculating "Stat total during quarter" in the "pandas way".
I ended up iterating through the dataframe cell-by-cell which feels like a suboptimal solution and then dividing one column by another to get the %
for elements in df.index:
df.ix[elements,3] = df[df['period']==df.ix[elements,0]]['stat'].sum()
df['% of quarter total'] = df.stat / df.['stat total during quarter']
Would really appreciate your thoughts.
Use groupby/transform to compute the sum for each period. transform always returns a Series of the same length as the original DataFrame. Thus, you can assign the return value to a new column of df:
df['stat total during quarter'] = df.groupby('period')['stat'].transform('sum')
df['% of quarter total'] = df['stat'] / df['stat total during quarter']