Why am I returned an object when using std() in Pandas? - pandas

The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))

This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!

Related

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Pandas Data frame column condition check based on length of the value

I have pandas data frame which gets created by reading an excel file. The excel file has a column called serial number. Then I pass a serial number to another function which connect to API and fetch me the result set for those serial number.
My Code -:
def create_excel(filename):
try:
data = pd.read_excel(filename, usecols=[4,18,19,20,26,27,28],converters={'Serial Number': '{:0>32}'.format})
except Exception as e:
sys.exit("Error reading %s: %s" % (filename, e))
data["Subject Organization"].fillna("N/A",inplace= True)
df = data[data['Subject Organization'].str.contains("Fannie",case = False)]
#df['Serial Number'].apply(lamda x: '000'+x if len(x) == 29 else '00'+x if len(x) == 30 else '0'+x if len(x) == 31 else x)
print(df)
df.to_excel(r'Data.xlsx',index= False)
output = df['Serial Number'].apply(lambda x: fetch_by_ser_no(x))
df2 = pd.DataFrame(output)
df2.columns = ['Output']
df5 = pd.concat([df,df2],axis = 1)
The problem I am facing is I want to check if df5 returned by fetch_by_ser_no() is blank then make the serial number as 34 characters by adding two more leading 00 and then check the function again.
How can I do it by not creating multiple dataframe
Any help!!
Thanks
You can try to use if ... else ...:
output = df['Serial Number'].apply(lambda x: 'ok' if fetch_by_ser_no(x) else 'badly')

Function giving error when run on the same dataframe more than once

Function giving error when run on the same data frame more than once. it works fine the first time but when run again on the same df it gives me this error:
IndexError: single positional indexer is out-of-bounds
def update_data(df):
df.drop(df.columns[[-1, -2, -3]], axis=1, inplace=True)
df.loc['Total'] = df.sum()
df.iloc[-1, 0] = 'Group'
df = df.set_index(list(df)[0])
for i in range(1, 21):
df.iloc[-1, i] = 100 + (100 * (
(df.iloc[-1, i] - df.iloc[-1, 0]) / abs(df.iloc[-1, 0])))
df.iloc[-1, 0] = 100
xax = list(df.columns.values)
yax = df.values[-1].tolist()
d = {'period': xax, 'level': yax}
index_level = pd.DataFrame(d)
index_level['level'] = index_level['level'].round(3)
return index_level
Using inplace=True in a function changes the input data frame. Of course there it doesn't work, your function presumes the data is in some format at the start of the function. That assumption is broken.
df = pd.DataFrame([{'x': 0}])
def change(df):
df.drop(columns=['x'], inplace=True)
return len(df)
change(df)
Out[346]: 1
df
Out[347]:
Empty DataFrame
Columns: []
Index: [0]

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final

pandas: map color argument by multidict

I would like to map a color to each row in the dataframe as a function of two columns. It would be much easier with just one column as argument. But how can I achieve this with two columns ?
What I have done so far:
a = np.random.rand(3,10)
i = [[30,10], [10, 30], [60, 60]]
names = ['a', 'b']
index = pd.MultiIndex.from_tuples(i, names = names)
df = pd.DataFrame(a, index=index).reset_index()
c1 = plt.cm.Greens(np.linspace(0.2,0.8,3))
c2 = plt.cm.Blues(np.linspace(0.2,0.8,3))
#c3 = plt.cm.Reds(np.linspace(0.2,0.8,3))
color = np.vstack((c1,c2))
a = df.a.sort_values().values
b = df.b.sort_values().values
mapping = dict()
for i in range(len(a)):
mapping[a[i]] = {}
for ii in range(len(b)):
mapping[a[i]][b[ii]] = color[i+ii]
Maybe something similar to df['color'] = df.apply(lamda x: mapping[x.a][x.b]) ?
Looks like you answered your own question. Apply can happen across the rows by changing the axis argument to 1. df['color'] = df.apply(lambda x: mapping[x.a][x.b], axis =1)