Conditional row number Pandas - pandas

I need to add row number to my dataframe based on certain condition, below is the image input data frame.
I need a row number column in my dataframe as illustrated in below image(Rank column).
so when ever "RequestResubmitted" value is found within group I want reset rank to 1 again.

Let us try cumsum create the cub key and groupby + cumcount
s=df.groupby([df['Word Order Code'],df['Status Code'].eq('Request Submitted').cumsum()]).cumcount()+1
df['rank']=s

Related

Dataframe add column with calculation

I've a Dataframe with multiple columns and trying to add a new column to calculate the sum between these two columns. Ss there any function can be loop through the whole dataframe?
Original
enter image description here
Desired
enter image description here
Why do all columns have the same name?
In pandas, we can create a new column and put the sum of the values of the other two columns in it with the following code:
df['sum'] = df['col_1'] + df['col_2']
But first you need to change the names of the columns so that they are not the same.
And then you can arrange the columns as desired.

How to use previous row value in next row in pyspark dataframe

I have a pyspark dataframe and I want to perform calculation as
for i in range(0,(length-1)):
x[i] = (x[i-1] - y[i-1]) * np.exp(-(t[i] -t[i-1])/v[i-1]) + y[i-1]
Where x,y,t and v are lists of float type columns created using
x = df.select(‘col_x’).rdd.flatMap(lambda x:x).collect()
And similarly y,t and v for respective columns.
This method works but not efficiently for data in bulk.
I want to perform this calculation in pyspark dataframe column. I want to update x column after every row and then use that updated value for calculating next row.
I have created columns to get previous row using lag as
df = df.withColumn(prev_val_x),F.lag(df.x,1).over(my_window)
And then calculating and updating x as -
df = df.withColumn(‘x’,(col(‘prev_val_x’) - col(‘prev_val_y’))
but it does not update the value with previous row value.
Creating lists for 4 columns using collect() takes a lot of time thus gives a memory error. Therefore, want to calculate within the dataframe column itself. Let column x has values as- 4.38,0,0,0,…till the end. X column only has value in its first row and then has 0 filled in all rows. Y,t and v has float values in it.
How do I proceed with this?
Any help would be appreciated!

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)