I have a dataframe and I want to find the standard deviation for some specific cells - pandas

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]

I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.

Related

Contatenate rows in Pandas

I have 12 months sales data for each month. I want to analyze the dataset as a whole.
I have tried using the concat function but It produces not a number (NaN) in my dataframe fields.
In R, cbind function solves this. How do i approach this differently in Python?
I tried using df.concat function to bind the rows cos all the column names are the same for the datasets.
What other options can i explore?
sales_1 = pd.read_csv('Sales_January_2019.csv')
sales_2 = pd.read_csv('Sales_February_2019.csv')
sales_3 = pd.read_csv('Sales_March_2019.csv')
sales_4 = pd.read_csv('Sales_April_2019.csv')
sales_5 = pd.read_csv('Sales_May_2019.csv')
sales_6 = pd.read_csv('Sales_June_2019.csv')
sales_7 = pd.read_csv('Sales_July_2019.csv')
sales_8 = pd.read_csv('Sales_August_2019.csv')
sales_9 = pd.read_csv('Sales_September_2019.csv')
sales_10 = pd.read_csv('Sales_October_2019.csv')
sales_11 = pd.read_csv('Sales_November_2019.csv')
sales_12 = pd.read_csv('Sales_December_2019.csv')
I expect all data frame to be merged into one since the column names are the same for all
perhaps
# using concat with the list of the DF that you already read-in to combine into a single DF
pd.concat([sales_1 ,sales_2 ,sales_3 ,sales_4 ,sales_5 ,sales_6 ,sales_7 ,sales_8 ,sales_9 ,sales_10 ,sales_11 ,sales_12 ])

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

How to use Pandas vector methods based on rolling custom function that involves entire row and prior data

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -
import numpy as np
import pandas as pd
#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
def func1(shortdf):
#dummy formula
#use last row of col1 multiply by sum of col2
return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14
for idx, i in df.iterrows():
if idx>3:
#only interested in the last 3 rows from position of dataframe
df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])
I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?
Option 1
So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.
df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.
Option 2
Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.
# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
s = df['col2']
stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)
# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)
stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.
res['final'] is identical to option 1's ans

Filtering out outliers in Pandas dataframe with rolling median

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates
I'm trying to use df.rolling to compute a median and standard deviation for each window and then remove the point if it is greater than 3 standard deviations.
However, I can't figure out a way to loop through the column and compare the the median value rolling calculated.
Here is the code I have so far
import pandas as pd
import numpy as np
def median_filter(df, window):
cnt = 0
median = df['b'].rolling(window).median()
std = df['b'].rolling(window).std()
for row in df.b:
#compare each value to its median
df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])
median_filter(df, 10)
How can I loop through and compare each point and remove it?
Just filter the dataframe
df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()
#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the first window; row 7 is the second window, and so on.
n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
## set window size
window=6
std = 1 # I set it at just 1; with real data and larger windows, can be larger
## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})
bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std
## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']
## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]
This is my take on creating a median filter:
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-1]
return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
return _median_filter
df.y.rolling(window).apply(median_filter(num_std=3), raw=True)

Generate programmed data & create data frame out of it (generated data in to single column)

Initial context was, I was using "for loop" and generating some random data (using some logic shown below) and then writing that data to a key ('server_avg_response_time') in dictionary('data_dict'). Finally., that's a list('data_rows') of dictionaries and writing the whole to CSV.
Code snippet for generating random data:
server_avg_response_time_alert = "low"
for i in range(0,no_of_rows):
if (random.randint(0,10) != 7 and server_avg_response_time_alert != "high"):
data_dict['server_avg_response_time'] = random.randint(1,800000)
else:
if(server_avg_response_time_alert == "low"):
print "***ALERT***"
server_avg_response_time_alert = "high"
data_dict['server_avg_response_time'] = random.randint(600000,800000)
server_avg_response_time_period = random.randint(1,1000)
if(server_avg_response_time_period > 980):
print "***ALERT OFF***"
server_avg_response_time_alert = "low"
data_rows.insert(i,data_dict.copy())
This is taking lot of time (to generate some 300 000 rows of data) and hence I was asked to look for Pandas (to generate data fastly). Now, I am trying to use the same logic to pandas dataframe.
Question: If I put above code in a function, can't I use that function to mint data in to column of dataframe? What is the best way to program this data in to a column of dataframe? I believe I don't need a dictionary (key) too if putting data directly to dataframe after generating it randomly. But don't know the syntax to do it.
try wrapping your logic (everything after the for loop) in a function, then pass that to an empty pandas df with one column called 'avg_resp_time' (with 30000 rows) like this using the apply method:
def randomLogic(value):
random_value = 0 # logic goes here
return random_value
df = pd.DataFrame(np.zeros(300000), columns=['server_avg_response_time'])
df['server_avg_response_time'] = df.server_avg_response_time.apply(randomLogic)