QPython Pandas Interaction

QPython Pandas Interaction - pandas

I have a question pertaining to Pandas Data Frame which I want to enrich with Timings from Tick Source(kdb Table).
Pandas DataFrame
Date sym Level
2018-07-01 USDJPY 110
2018-08-01 GBPUSD 1.20
I want to enrich this dataframe with timings (first time for a given currency pair for a given date when the level is crossed).
from qpython import qconnection
from qpython import MetaData
from qpython.qtype import QKEYED_TABLE
from qpython.qtype import QSTRING_LIST, QINT_LIST,
QDATETIME_LIST,QSYMBOL_LIST
q.open()
df.meta = MetaData(sym = QSYMBOL_LIST, val = QINT_LIST, Date =
QDATE_LIST)
q('set', np.string_('tbl'), df)
The above code converts pandas dataframe to q table.
Example Code to Access tick data(kdb Tables)
select Mid by sym,date from quotestackevent where date = 2018.07.01, sym = `CCYPAIR
How can I use dataframe columns sym and date to pull data from kdb tables using Qpython?

Suppose on the KDB+ side you have a table t with columns sym (of type symbol), date (of type date), and mid (of type float), for example generated by the following code:
t:`date xasc ([] sym:raze (3#) each `USDJPY`GBPUSD`EURBTC;date:9#.z.d-til 3;mid:9?`float$10)
Then to bring the data for enrichment from the KDB+ side to the Python side you can do the following:
from qpython import qconnection
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-08','2018-09-08','2018-09-07','2018-09-07'],'sym':['abc','def','abc','def']})
df['Date']=df['Date'].astype('datetime64[ns]')
with qconnection.QConnection(host = 'localhost', port = 5001, pandas = True) as q:
X = q.sync('{select sym,date,mid from t where date in `date$x}',df['Date'])
Here the first argument to q.sync() defines a function to be executed and the second argument is the range of dates you want to get from the table t. Inside the function the `date$x part converts the argument to a list of dates, which is needed because df['Date'] is sent as a list of timestamps to the KDB+ side.
The resulting X data frame will have the sym column as binary strings, so you may want to do something like
X['sym'].apply(lambda x: x.decode('ascii'))
to convert that to strings.
An alternative to sending the function definition is to have a function defined on the KDB+ side and send only its name from the Python side. So, if you can do something like
getMids:{select sym,date,mid from t where date in `date$x}
on the KDB+ side, then you can do
X = q.sync('getMids',df['Date'])
instead of sending the function definition.

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|". Each entry of the column is of the form:
'x=ABCDefgh_5|123|' 
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry 'x=ABCDefgh_5|123|'  without stripping out anything. Is there an error in my code?

Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.

If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)

First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question

With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

how can i get mean value of str type in a dataframe in Pandas

I have a DataFrame from pandas:
i want to get a mean value of "stop_duration" for each "violation_raw".
How can i do it if column "stop_duration" is object type
df = enter code herepd.read_csv('police.csv', parse_dates=['stop_date'])
df[['stop_date', 'violation_raw','stop_duration']]
My table:
the table

Use to_datetime function to convert object to datetime. Also specifying a format to match your data.
import pandas as pd
df["column"] = pd.to_datetime(df["column"], format="%M-%S Min")

Pandas - get count of each boolean field

I have other programs where I group and count fields. Now, I want to get a count of each boolean field. Is there a Pandas way to do that rather than me looping and writing my own code? Ideally, I would generated a new dataframe with the results (kind of like what I did here).
Easy Example CSV Data (data about poker hands generated):
Hand,Other1,Other2,IsFourOfAKind,IsThreeOfAKind,IsPair
1,'a','b',1,0,0
2,'c','d',0,1,0
3,'a','b',0,1,0
4,'x','y',0,0,1
5,'a','b',0,0,1
6,'a','b',0,0,1
7,'a','b',0,0,1
Program:
import pandas as pd
import warnings
filename = "./data/TestGroup2.csv"
# tell run time to ignore certain read_csv type errors (from pandas)
warnings.filterwarnings('ignore', message="^Columns.*")
count_cols = ['IsFourOfAKind','IsThreeOfAKind','IsPair ']
enter code here
#TODO - use the above to get counts of only these columns
df = pd.read_csv(filename)
print(df.head(10))
Desired Output - could just be a new dataframe
Column Count
IsFourOfAKind 1
IsThreeOfAKind 2
IsPair 3

Please try:
df.filter(like='Is').sum(0)
or did you need;
df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

QPython Pandas Interaction - pandas

Related

Extracting portions of the entries of Pandas dataframe

python - if-else in a for loop processing one column

How to categorize a range of hours in Pandas?

how can i get mean value of str type in a dataframe in Pandas

Pandas - get count of each boolean field

Categories

Resources