AttributeError for df.apply() when trying to subtract the column mean and divide by the column standard deviation for each column in a dataframe - pandas

I have a data frame with roughly 26 columns. For 14 of these columns (all data are floats) I want to determine the mean and standard deviation for each column, then for each value in each column I want to subtract the column mean and divide by the column standard deviation (only for the column to which the value belongs).
I can do this separately for each column like so:
chla_array = df['Chla'].to_numpy()
mean_chla = np.nanmean(chla_array)
std_chla = np.nanstd(chla_array)
df['Chla_standardized'] = (df['Chla'] - mean_chla) / std_chla
Because I have 14 columns to do this for, I am looking for a more concise way of coding this, rather than copy and pasting the above code out thirteen more times and changing the column headers. I was thinking of using df.apply() but I can't get it to work. Here is what I have:
df = df.iloc[:, [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]]
df_standardized = df.apply((df - df.mean(skipna=True)) / df.std(skipna=True, ddof=0))
The error I encounter is this:
AttributeError: 'Canyon_dist' is not a valid function for 'Series' object
Where 'Canyon_dist' is the header for the first column the code encounters.
I'm not sure that df.apply is appropriate for what I am trying to achieve, so if there is a more appropriate way of doing this please let me know (perhaps using a for loop?).
I am open to all suggestions and thank you.

Related

Select column for formatting based on cell content

In xlsxWriter I can format a column by selecting its column index as it is in Excel via letters e.g. worksheet.set_column('V:V', 20, cell_format). Is it also possible to define which column to format based on its value (e.g. header columns)?
For example worksheet.set_column('cell_value', 20, cell_format)?
Is it also possible to define which column to format based on its value (e.g. header columns)?
That isn't possible using xlsxwriter, which in general is a write-only data store. Instead you could track the column headers and indices in a dict and use that. Something like
cols = {'Year' : 0, 'Income': 1, 'Outgoings': 2}
worksheet.set_column(cols['Year'], cols['Year'], 20, cell_format)

Pandas columns by given value in last row

Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.

FIltering rows and appending values

I want to filter a dataframe and append a value to a new/existing column in the dataframe. For example in the following dataframe, I want to append a value of 0.7 to column pre-mean where the month values are equal to 11. So in other words pre_mean column should contain values of 0.7 at rows 2 to 5 while all other columns should have a NaN value.
I tried something like this, but of course it's incorrect.
df[:pre_mean] = ifelse.(df[:month] .== 11, 0.7, df)
In python, you can do this using pd.apply or np.where functions,
#How to do in python
df["pre_mean"] = np.where(df["month"] == 11, 0.7, None)
But I have got no clue how to achieve this in Julia? Any Ideas?
df[df.month .== 11, :pre_mean] .= 0.7
This should work.
Also the answer in the question is almost correct:
df.pre_mean .= ifelse.(df.month .== 11, 0.7, df.premean)
which should be faster than the solution proposed by #Andy_101 (which is also correct) as it does not allocate.
As a side note observe that df[:pre_mean] is not allowed in DataFrames.jl. Data frame is a 2-dimensional object so you have to pass both row and column selector (unless you use a getproperty method like in the answer of #Andy_101 and mine).

Selecting dataframe column that is a mix of string and number in groupby operation

How do I select the column name in the groupby operation since the column name has integers mixed with string. I cannot replace the name since I need to keep track.
import pandas as pd
Signal_val = {'Signal-230_115':[20, 7, 8, 17, 19, 19,7,17,17,17], 'above': [True,False,False,True,True,True,False,True,True,True]}
df = pd.DataFrame(data = Signal_val)
groups = df.groupby(['above']).Signal-230_115.size()
I am getting invalid syntax error in this case. Is there a way I can select the specified signal column in the groupby operation without explicitly mentioning the Signal name?

reindex group to add missing rows

I am trying to reindex groups to extend dataframes with missing values. Similar as resample works for time indexes, I am trying to achieve this for normal integer values.
So, for a group belonging to a certain group key (proID in my case) a maximum existent integer value shall be determined (specifying the end point of the resampling process). The group shall be extended (I was trying to achieve it with reindex) by the missing values of this integer value.
I have a dataframe having many rows per proID and a integer bin value which can range from 0 to 100 and some meaningless columns. Basically, the bin value shall be filled if some data are missing similarly as resample would do for time indexes.
def rsmpint(df):
mx = df.bin.max() #identify maximal existing bin value in dataframe (group)
no = (mx * 20 / 100).astype(np.int64) + 1 #calculate number of bin values
idx = pd.Index(np.linspace(0,mx,no), name='bin') # define full bin-Index for df (group)
df.set_index('bin').reindex(idx).ffill().reset_index(drop=True, inplace=True)
return df
DF.groupby('proID').apply(rsmpint)
Let assume for a specific proID there are currently 5 bin values [0, 15, 20, 40, 65] (i.e. 5 rows of the original proID group). The output shall be an extended proID group with bin values [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 65] with the content of the "meaningless" columns filled using ffill().