Return KDB query to a pandas dataframe - pandas

I would like to extract data from a KDB database and place into a dataframe. My query runs fine in qpad, no issues; just need to write it into my Pandas dataframe. My code:
from qpython import qconnection
# Create the connection and save the handle to a variable
q = qconnection.QConnection(host = 'wokplpaxvj003', port = 11503, username = 'pelucas', password = 'Dive2600', timeout = 3.0)
try:
# initialize connection
q.open()
print(q)
print('IPC version: %s. Is connected: %s' % (q.protocol_version, q.is_connected()))
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}')
df.info()
finally:
q.close()
It fails on the df.info() raising AttributeError: 'QLambda' object has no attribute 'info' so I guess the call is not successful.

It looks like you've sent only a lambda but with no instruction to execute that lambda. Two options:
Don't make it a lambda
df = q.sendSync('select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id')
Execute the lambda
df = q.sendSync('{select from quote_flat where date within (2019.08.14;2019.08.14), amendment_no = (max;amendment_no)fby quote_id}[]')

Related

KeyError: 'date' Pandas

```if __name__ == "__main__":
pd.options.display.float_format = '{:.4f}'.format
temp1 = pd.read_csv('_4streams_alabama.csv.gz')
temp1['date'] = pd.to_datetime(temp1['date'])
def vacimpval(x):
for date in x['date'].unique():
if date >= '2022-06-16':
x['vac_count'] = x['vac_count'].interpolate()
x['vac_count'] = x['vac_count'].astype(int)
for location in temp1['location_name'].unique():
s = temp1.apply(vacimpval)```
In the code above, I am trying to use this function for all the location so that I can fill in the values using the interpolate method() but I don't know why I keep getting an key error
Source of the error:
Since there are only two places in your code where you access 'date',
and as you said, temp1.columns contains 'date', then the problem is in x['date'].

concatenate results after multiprocessing

I have a function which is creating a data frame by doing multiprocessing on a df:-
Suppose if I am having 10 rows in my df so the function processor will process all 10 rows separately. what I want is to concatenate all the output of the function processor and make one data frame.
def processor(dff):
"""
reading data from a data frame and doing all sorts of data manipulation
for multiprocessing
"""
return df
def main(infile, mdebug):
global debug
debug = mdebug
try:
lines = sum(1 for line in open(infile))
except Exception as err:
print("Error {} opening file: {}").format(err, infile)
sys.exit(2000)
if debug >= 2:
print(infile)
try:
dff = pd.read_csv(infile)
except Exception as err:
print("Error {}, opening file: {}").format(err, infile)
sys.exit(2000)
df_split = np.array_split(dff, (lines+1))
cores = multiprocessing.cpu_count()
cores = 64
# pool = Pool(cores)
pool = Pool(lines-1)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
if frame is not None:
frame.to_csv('{}'.format(n))
pool.close()
pool.join()
if __name__ == "__main__":
args = parse_args()
"""
print "Debug is: {}".format(args.debug)
"""
if args.debug >= 1:
print("Running in debug mode: "), args.debug
main(infile=args.infile, mdebug=args.debug)
you can use either the data frame constructor or concat to solve your problem. the appropriate one to use depends on details of your code that you haven't included
here's a more complete example:
import numpy as np
import pandas as pd
# create dummy dataset
dff = pd.DataFrame(np.random.rand(101, 5), columns=list('abcde'))
# process data
with Pool() as pool:
result = pool.map(processor, np.array_split(dff, 7))
# put it all back together in one dataframe
result = np.concat(result)

Writing Data from pandas dataframe to PostgreSQL gives error of 'DataFrame' objects are mutable, thus they cannot be hashed

i am trying to save a data frame which was first imported in pandas from postgresql as dfraw and then do some manipulation and create another dataframe as df and save it back in postgresql same database using sql alchemy. but when i am trying to save it back its giving error of 'DataFrame' objects are mutable, thus they cannot be hashed
PFB code below
import psycopg2
import pandas as pd
import numpy as np
import sqlalchemy
from sqlalchemy import create_engine
# connect the database to python
# Update connection string information
host = "something.something.azure.com"
dbname = "abcd"
user = "abcd"
password = "abcd"
sslmode = "require"
schema = 'xyz'
# Construct connection string
conn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(host, user, dbname, password, sslmode)
conn = psycopg2.connect(conn_string)
print("Connection established")
cursor = conn.cursor()
# Fetch all rows from table
cursor.execute("SELECT * FROM xyz.abc;")
rows = cursor.fetchall()
# Convert the tuples in dataframes
dfraw = pd.DataFrame(rows, columns =["ID","Timestamp","K","S","H 18","H 19","H 20","H 21","H 22","H 23","H 24","H 2zzz","H zzz4","H zzzzzz","H zzz6","H zzz7","H zzz8","H zzz9","H 60","H zzz0","H zzz2"])
dfraw[["S","H 18","H 19","H 20","H 21","H 22","H 23","H 24","H 2zzz","H zzz4","H zzzzzz","H zzz6","H zzz7","H zzz8","H zzz9","H 60","H zzz0","H zzz2"]] = dfraw[["S","H 18","H 19","H 20","H 21","H 22","H 23","H 24","H 2zzz","H zzz4","H zzzzzz","H zzz6","H zzz7","H zzz8","H zzz9","H 60","H zzz0","H zzz2"]].apply(pd.to_numeric)
dfraw[["Timestamp","K"]]=dfraw[["Timestamp","K"]].apply(pd.to_datetime)
# Creating temp files
temp1 = dfraw
dfraw = temp1
# creating some fucntions for data manipulation and imputations
def remZero(df,dropCol):
for k in df.drop(dropCol,axis=1):
if all(df[k] == 0):
continue
if any(df[k] == 0):
print(k)
df[k] = df[k].replace(to_replace=0, method='ffill')
return df
# Drop Columns function
dropCol = ['Timestamp','K','ID','H','C','S']
dropCol2 = ['Timestamp','K','ID','Shift']
df = remZero(dfraw,dropCol)
from sqlalchemy import create_engine
engine = create_engine('postgresql://abcd:abcd#something.something.azure.com:5432/abcd')
df.to_sql(name = df,
con=engine,
index = False,
if_exists= 'replace'
)
Error Message
Found basic error in the code I just missed putting the inverted comma before the data frame name to be published. The basic hygiene was missed
df.to_sql(name = "df",
con=engine,
index = False,
if_exists= 'replace'
)

Writing loop values within a function into Pandas dataframe

The function below is part of googlesheets quickstart.py to allow people to read a Googlesheet URL.
I am able to run the test and getting the print to work.
See the print statement in the function below:
print('%s, %s,%s,%s,%s,%s,%s,%s' % (row[0],row[1],row[2],row[3], row[4], row[5],row[6],row[7]))
My ultimate goal is to capture the data in the print into a pandas dataframe instead. All my attempts did not work.
def main():
"""Shows basic usage of the Sheets API.
Prints values from a sample spreadsheet.
"""
creds = None
# The file token.pickle stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
creds = pickle.load(token)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', SCOPES)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(creds, token)
service = build('sheets', 'v4', credentials=creds)
# Call the Sheets API
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
if not values:
print('No data found.')
else:
#print('Name, Major:')
for row in values:
# d = {'Case_Type':row.Case_Type,
# 'Date':row.Date,
# 'Cases':row.Cases,
# 'Country_Region':row.Country_Region,
# 'Lat':Lat,
# 'Long':Long}
# L.append(d)
# df = pd.DataFrame(L)
#Print columns A and E, which correspond to indices 0 and 7.
print('%s, %s,%s,%s,%s,%s,%s,%s' % (row[0],row[1],row[2],row[3], row[4], row[5],row[6],row[7]))
if __name__ == '__main__':
main()
Since values seems to be a 2D list, try doing
pd.DataFrame.from_records(values, columns=['Date', 'Cases', 'Country_Region', 'Lat', 'Long'])

Problems appending rows to DataFrame. ZMQ messages to Pandas Dataframe

I am taking messages for market data from a ZMQ subscription and turning it into a pandas dataframe.
I tried creating a empty dataframe and appending rows to it. It did not work out. I keep getting this error.
RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort
order is undefined for incomparable objects
result = result.union(other)
Im guessing this is because Im appending a list of strings to a dataframe. I clear the list then try to append the next row. The data is 9 rows. First one is a string and the other 8 are all floats.
list_heartbeat = []
list_fills= []
market_data_bb = []
market_data_fs = []
abacus_fs = []
abacus_bb =[]
df_bar_data_bb = pd.DataFrame(columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
def main():
context = zmq.Context()
socket_sub1 = context.socket(zmq.SUB)
socket_sub2 = context.socket(zmq.SUB)
socket_sub3 = context.socket(zmq.SUB)
print('Opening Socket...')
# We can connect to several endpoints if we desire, and receive from all.
print('Connecting to Nicks BroadCast...')
socket_sub1.connect("Server:port")
socket_sub2.connect("Server:port")
socket_sub3.connect("Server:port")
print('Connected To Nicks BroadCast... Waiting For Messages.')
print('Connected To Jasons Two BroadCasts... Waiting for Messages.')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'H')
socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'R')
#socket_sub1.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT') #possible heartbeat from Jason
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_FS')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'HEARTBEAT')
socket_sub2.setsockopt_string(zmq.SUBSCRIBE, 'BAR_BB')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_FS')
socket_sub3.setsockopt_string(zmq.SUBSCRIBE, 'ABA_BB')
poller = zmq.Poller()
poller.register(socket_sub1, zmq.POLLIN)
poller.register(socket_sub2, zmq.POLLIN)
poller.register(socket_sub3, zmq.POLLIN)
while (running):
try:
socks = dict(poller.poll())
except KeyboardInterrupt:
break
#check if the message is in socks, if so then save to message1-3 for future use.
#Msg1 = heartbeat for Nicks server
#Msg2 = fills
#msg3 Mrkt Data split between FS and BB
#msg4
if socket_sub1 in socks:
message1 = socket_sub1.recv_string()
list_heartbeat.append(message1.split())
if socket_sub2 in socks:
message2 = socket_sub2.recv_string()
message3 = socket_sub2.recv_string()
if message2 == 'HEARTBEAT':
print(message2)
print(message3)
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
#print(message3_split)
message3_split = message3_split
market_data_bb.append(message3_split)
if len(market_data_bb) > 20:
#df_bar_data_bb = pd.DataFrame(market_data_bb, columns= ['Ticker','Start_Time_Intervl','Interval_Length','Current_Open_Price',
# 'Previous_Open','Previous_Low','Previous_High','Previous_Close','Message_ID'])
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
#ESA = df_bar_data_bb[df_bar_data_bb['Ticker'] == 'ESA Index'].copy()
#print(ESA)
#df_bar_data_bb.set_index('Start_Time_Intervl', inplace=True)
df_bar_data_bb.append(market_data_bb)
market_data_bb.clear()
print(df_bar_data_bb)
The very bottom is what throws the Error. I found a simple way around this that may or may not work. Its the 4 lines above that create a dataframe then set the index and try to create copies of the dataframe. The only problem is I get about anywhere from 40-90 messages a second and every time I get a new one it creates a new dataframe. I eventually have to create a graph out of this and im not exactly sure how I would create a live graph out of this. But thats another problem.
EDIT: I figured it out. Instead of adding the messages to a list I simply convert each message to a pandas series then call my dataframe globally then do df=df.append(message4,ignore_index=True)
I completely removed the need for lists
if message2 == 'BAR_BB':
message3_split = message3.split(";")
message3_split = [e[3:] for e in message3_split]
message4 = pd.Series(message3_split)
global df_bar_data_bb1
df_bar_data_bb1 = df_bar_data_bb1.append(message4, ignore_index = True)