Iterating through a pandas dataframe to create a chart that adds up to 100% - pandas

I have the following dataframe
I want add two columns "Stat total during quarter"(total value of "stat" without break down by param applied) and "% of quarter total" -- that would show how proportions have been changing over time and build a stacked chart that adds up to 100%
Unfortunately I have troubles calculating "Stat total during quarter" in the "pandas way".
I ended up iterating through the dataframe cell-by-cell which feels like a suboptimal solution and then dividing one column by another to get the %
for elements in df.index:
df.ix[elements,3] = df[df['period']==df.ix[elements,0]]['stat'].sum()
df['% of quarter total'] = df.stat / df.['stat total during quarter']
Would really appreciate your thoughts.

Use groupby/transform to compute the sum for each period. transform always returns a Series of the same length as the original DataFrame. Thus, you can assign the return value to a new column of df:
df['stat total during quarter'] = df.groupby('period')['stat'].transform('sum')
df['% of quarter total'] = df['stat'] / df['stat total during quarter']

Related

Pandas vectorization using all cumulative sum of all previous rows and all following rows

Im trying to convert an excel sheet to Python and
part of one of the functions is (SUMSQ(K$8:K36)+SUMSQ(K38:K$39)))
this is the sum of squares of all previous rows plus the sum of squares of all following rows
my best attempt is:
def SUMSQ(x):
return ((sum([i**2 for i in x]))-(sum(x)**2)/len(x))
SUMSQ(df1['K']) + SUMSQ(df1.iloc[-1::]['K'])
I know I'm not indexing the part of the data frame correctly. Is there a way to index from the beginning of the column to the row above the current position?
Your question could use some clarification, but it seems like you want to exclude the current row from the SUMSQ calculation. If so, then the following should work.
df = pd.DataFrame({'k': range(1, 11)})
for idx in range(len(df)):
print(SUMSQ(df.loc[df.index != idx, 'k']))
Output is
60.0
68.88888888888891
75.55555555555554
80.0
82.22222222222223
82.22222222222223
80.0
75.55555555555554
68.88888888888889
60.0

counting each value in dataframe

So I want to create a plot or graph. I have a time series data.
My dataframe looks like that:
df.head()
I need to count values in df['status'] (there are 4 different values) and df['group_name'] (2 different values) for each day.
So i want to have date index and count of how many times each value from df['status'] appear as well as df['group_name']. It should return Series.
I used spam.groupby('date')['column'].value_counts().unstack().fillna(0).astype(int) and it working as it should. Thank you all for help

Apply function for counting length of a DataFrame filter

What's the best way to create a new pandas column with the length of filtering of another df based on a value from the first df?
df_account has account numbers
df_retention has rows for each date an account numbers was active
I am trying to create a new column on df_account that has the total number of days the account was active. Using .apply seems extremely slow.
def retention_count(x):
return len(df_retention[df_retention['account'] == x])
df_account['retention_total'] = df_account['account'].apply(retention_count)
On a small number of rows, this works, but when my df_account has over 750k rows it is really slow. What can I do to make this faster? Thanks.
You could use groupby and count the rows in the df_retention dataframe. Assuming account is your index on df_account
df_account.set_index('account',inplace=True)
df_account['retention_total'] = df_retention.groupby('account').count()

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)

Realign Pandas Rolling Average series in a dataframe

Very new to Pandas. Importing stock data in a DF and I want to calculate 10D rolling average. That I can figure out. Issue is it gives 9 NaN because of the 10D moving average period.
I want to re-align the data so the 10th piece of data is a new rolling average column at the top of the data frame. I tried moving the data by writing the following code:
small = pd.rolling_mean(df['Close'],10)
and then trying to add that to the df with the following code
df['MA10D2'] = small[9:]
but it still provides 9 NaN at the top. Anyone can help me out?
Assignment will be done based on index. small[9:] will start the index at position 9, thus the assignement will keep their position starting at index 9.
The function you are searching for is called shift:
df['MA10D2'] = small.shift(-9)