Concatenate DataFrames.DataFrame in Julia - dataframe

I have a problem when I try to concatenate multiple DataFrames (a datastructure from the DataFrames package!) with the same columns but different row numbers. Here's my code:
using(DataFrames)
DF = DataFrame()
DF[:x1] = 1:1000
DF[:x2] = rand(1000)
DF[:time] = append!( [0] , cumsum( diff(DF[:x1]).<0 ) ) + 1
DF1 = DF[DF[:time] .==1,:]
DF2 = DF[DF[:time] .==round(maximum(DF[:time])),:]
DF3 = DF[DF[:time] .==round(maximum(DF[:time])/4),:]
DF4 = DF[DF[:time] .==round(maximum(DF[:time])/2),:]
DF1[:T] = "initial"
DF2[:T] = "final"
DF3[:T] = "1/4"
DF4[:T] = "1/2"
DF = [DF1;DF2;DF3;DF4]
The last line gives me the error
MethodError: Cannot `convert` an object of type DataFrames.DataFrame to an object of type LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame
This may have arisen from a call to the constructor LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame(...),
since type constructors fall back to convert methods.
I don't understand this error message. Can you help me out? Thanks!

I just ran into this exact problem on Julia 0.5.0 x86_64-linux-gnu, DataFrames 0.8.5, with both hcat and vcat.
Neither clearing the workspace nor reloading DataFrames solved the problem, but restarting the REPL fixed it immediately.

Related

How to resolve :first argument must be an iterable of pandas objects, you passed an object of type "DataFrame" in Pandas

I am getting this error:
first argument must be an iterable of pandas objects, you passed an object of type "DataFrame".
My code:
for f in
glob.glob("C:/Users/panksain/Desktop/aovaNALYSIS/CX AOV/Report*.csv"):
data = pd.concat(pd.read_csv(f,header = None, names = ("Metric Period", "")), axis=0, ignore_index=True)
concat takes a list of dataframes to concat with. You can first build the list and then do the concat at last:
dfs = []
for f in glob.glob("C:/Users/panksain/Desktop/aov aNALYSIS/CX AOV/Report*.csv"):
dfs.append(pd.read_csv(f,header = None, names = ("Metric Period", "")))
data = pd.concat(dfs, axis=0, ignore_index=True)

KeyError to existing Colums in my panda dataframe after transposing, with dtype='object'

So i have some trouble with trying to make basic arithmetic with columns in pandas. the datatype of my columns after transposing are 'object'. Because of this i get a KeyError when trying to add a column to another column.
After transposing I have added the code:
dg.columns = dg.columns.astype(str)
The dont seem to react to it, anyone knows how to solve this?
my full code:
dg = pd.read_csv("file.csv", encoding="latin-1", header = None)
dg = dg.T
dg.columns= dg.iloc[0]
dg = dg.reindex(dg.index.drop(0))
dg.index.name = 'Date'
dg = dg.fillna(0)
dg.drop(dg.columns.difference(['Category','revenue','Result']), 1, inplace=True)
dg.columns = dg.columns.astype(str)
print (dg.columns)
dg['revenue','Result'] = pd.to_numeric(dg['revenue','Result'], errors='coerce')
dg['cost'] = dg['revenue']* - dg['Result']
dg = dg.groupby('Category','revenue','Result','cost').agg(sum).reset_index()
print (dg[:5])

Problems appending rows to DataFrame. ZMQ messages to Pandas Dataframe

I am taking messages for market data from a ZMQ subscription and turning it into a pandas dataframe.
I tried creating a empty dataframe and appending rows to it. It did not work out. I keep getting this error.
RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort
order is undefined for incomparable objects
result = result.union(other)
Im guessing this is because Im appending a list of strings to a dataframe. I clear the list then try to append the next row. The data is 9 rows. First one is a string and the other 8 are all floats.
list_heartbeat = []
list_fills= []
market_data_bb = []
market_data_fs = []
abacus_fs = []
abacus_bb =[]
df_bar_data_bb = pd.DataFrame(columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
def main():
context = zmq.Context()
socket_sub1 = context.socket(zmq.SUB)
socket_sub2 = context.socket(zmq.SUB)
socket_sub3 = context.socket(zmq.SUB)
print('Opening Socket...')
# We can connect to several endpoints if we desire, and receive from all.
print('Connecting to Nicks BroadCast...')
socket_sub1.connect("Server:port")
socket_sub2.connect("Server:port")
socket_sub3.connect("Server:port")
print('Connected To Nicks BroadCast... Waiting For Messages.')
print('Connected To Jasons Two BroadCasts... Waiting for Messages.')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'H')
socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'R')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT') #possible heartbeat from Jason
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_FS')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_BB')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_FS')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_BB')
poller = zmq.Poller()
poller.register(socket_sub1, zmq.POLLIN)
poller.register(socket_sub2, zmq.POLLIN)
poller.register(socket_sub3, zmq.POLLIN)
while (running):
try:
socks = dict(poller.poll())
except KeyboardInterrupt:
break
#check if the message is in socks, if so then save to message1-3 for future use.
#Msg1 = heartbeat for Nicks server
#Msg2 = fills
#msg3 Mrkt Data split between FS and BB
#msg4
if socket_sub1 in socks:
message1 = socket_sub1.recv_string()
list_heartbeat.append(message1.split())
if socket_sub2 in socks:
message2 = socket_sub2.recv_string()
message3 = socket_sub2.recv_string()
if message2 == 'HEARTBEAT':
print(message2)
print(message3)
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
#print(message3_split)
message3_split = message3_split
market_data_bb.append(message3_split)
if len(market_data_bb) > 20:
#df_bar_data_bb = pd.DataFrame(market_data_bb, columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
# 'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
#ESA = df_bar_data_bb[df_bar_data_bb['Ticker'] == 'ESA Index'].copy()
#print(ESA)
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
df_bar_data_bb.append(market_data_bb)
market_data_bb.clear()
print(df_bar_data_bb)
The very bottom is what throws the Error. I found a simple way around this that may or may not work. Its the 4 lines above that create a dataframe then set the index and try to create copies of the dataframe. The only problem is I get about anywhere from 40-90 messages a second and every time I get a new one it creates a new dataframe. I eventually have to create a graph out of this and im not exactly sure how I would create a live graph out of this. But thats another problem.
EDIT: I figured it out. Instead of adding the messages to a list I simply convert each message to a pandas series then call my dataframe globally then do df=df.append(message4,ignore_index=True)
I completely removed the need for lists
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
message4 = pd.Series(message3_split)
global df_bar_data_bb1
df_bar_data_bb1 = df_bar_data_bb1.append(message4, ignore_index = True)

Why am I returned an object when using std() in Pandas?

The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))
This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!

Applying different functions to different columns of grouped dataframe

I am new to Pandas. I have grouped a dataframe by date and applied a function to different columns of the dataframe as shown below
def func(x):
questionID = x['questionID'].size()
is_true = x['is_bounty'].sum()
is_closed = x['is_closed'].sum()
flag = True
return pd.Series([questionID, is_true, is_closed, flag], index=['questionID', 'is_true', 'is_closed', 'flag'])
df_grouped = df1.groupby(['date'], as_index = False)
df_grouped = df_grouped.apply(func)
But when I run this I get an error saying
questionID = x['questionID'].size()
TypeError: 'int' object is not callable.
When I do the same thing this way it doesn't give any error.
df_grouped1 = df_grouped['questionID'].size()
I don't understand where am I going wrong.
'int' object is not callable. means you have to use size without ()
x['questionID'].size
For some objects size is only value, for others it can be function.
The same can be with other values/functions.