Dask Dataframe: Defining meta for date diff in groubpy - pandas

I'm trying to find inter-purchase times (i.e., days between orders) for customers. Although my code is working correctly without defining meta, I would like to get it working properly and no longer see the warning asking me to provide meta.
Also, I would appreciate any suggestions on how to use map or map_partitions instead of apply.
So far I've tried:
meta={'days_since_last_order': 'datetime64[ns]'}
meta={'days_since_last_order': 'f8'}
meta={'ORDER_DATE_DT':'datetime64[ns]','days_since_last_order': 'datetime64[ns]'}
meta={'ORDER_DATE_DT':'f8','days_since_last_order': 'f8'}
meta=('days_since_last_order', 'f8')
meta=('days_since_last_order', 'datetime64[ns]')
Here is my code:
import numpy as np
import pandas as pd
import datetime as dt
import dask.dataframe as dd
from dask.distributed import wait, Client
client = Client(processes=True)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
d = (end - start).days + 1
np.random.seed(0)
df = pd.DataFrame()
df['CUSTOMER_ID'] = np.random.randint(1, 4, 10)
df['ORDER_DATE_DT'] = start + pd.to_timedelta(np.random.randint(1, d, 10), unit='d')
print(df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT']))
print(df)
ddf = dd.from_pandas(df, npartitions=2)
# setting ORDER_DATE_DT as index to sort by date
ddf = ddf.set_index('ORDER_DATE_DT')
ddf = client.persist(ddf)
wait(ddf)
ddf = ddf.reset_index()
grp = ddf.groupby('CUSTOMER_ID')[['ORDER_DATE_DT']].apply(
lambda df: df.assign(days_since_last_order=df.ORDER_DATE_DT.diff(1))
# meta=????
)
# for some reason, I'm unable to print grp unless I reset_index()
grp = grp.reset_index()
print(grp.compute())
Here is the printout of df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT'])
Here is the printout of grp.compute()

Related

How to add avwap to pandas_ta?

I am trying to get anchored vwap from specific date using pandas_ta. How to set anchor to specific date?
import pandas as pd
import yfinance as yf
import pandas_ta as ta
from datetime import datetime, timedelta, date
import warnings
import plac
data = yf.download("aapl", start="2021-07-01", end="2022-08-01")
df = pd.DataFrame(data)
df1 = df.ta.vwap(anchor = "D")
df14 = pd.concat([df, df1],axis=1)
print(df14)
pandas_ta.vwap anchor depending on the index values, as pandas-ta said(reference)
anchor (str): How to anchor VWAP. Depending on the index values, it will
implement various Timeseries Offset Aliases as listed here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases
Default: "D".
In further words, you can't specify a specific date as TradingView did.
To anchor a date ourself,
import pandas as pd
import numpy as np
import yfinance as yf
import pandas_ta as ta
# set anchor date
anchored_date = pd.to_datetime('2022-01-30')
data = yf.download("aapl", start="2022-01-01", end="2022-08-01")
df = pd.DataFrame(data)
df1 = df.ta.vwap(anchor = "D")
df14 = pd.concat([df, df1],axis=1)
# I create a column 'typical_price', it should be identical with 'VWAP_D'
df14['typical_price'] = (df14['High'] + df14['Low'] + df14['Close'])/3
tpp_d = ((df14['High'] + df14['Low'] + df14['Close'])*df14['Volume'])/3
df14['anchored_VWAP'] = tpp_d.where(df14.index >= anchored_date).groupby(df14.index >= anchored_date).cumsum()/df14['Volume'].where(df14.index >= anchored_date).groupby(df14.index >= anchored_date).cumsum()
df14
Plot

Python date comparison not working in .exe app

I have created a python script that works fine when running it in Spyder. I then freeze it with pyinstaller. When I run the .exe app, I get the following error.
Here is the relevent code:
import pandas as pd
import os
from datetime import datetime, time
import teradata as td
import numpy as np
import smtplib
import xlrd #needed for .exe
### Import Fleet Plan file ###
path = '\\\PHX43XCIFSC0001\Planning'
folder = '\Aircraft Availability'
file = '\\NP Fleet Plan.xlsx'
sheet = 'Mainline'
colnames = [0,2]
link = path + folder + file
update = pd.Timestamp.date(pd.Timestamp(datetime.fromtimestamp(
os.path.getmtime(link)), unit='s'))
mydata = pd.read_excel(link, sheet_name = sheet, header=colnames, index=None)
df = mydata
# Flatten multiindex to single columns
df.columns = (['{}:{}'.format(i[0], i[1]) for i in df])
df = df.reset_index()
df = df.rename(columns={'index':'mDate', df.columns[1]:'DOW'})
# Remove blank columns and Fleet level columns
xcolunassigned = [col for col in df.columns if 'Unnamed' in col]
df = df.drop(xcolunassigned, axis=1)
xcolfleet = [col for col in df.columns if 'FLEET' in col]
df = df.drop(xcolfleet, axis=1)
# Transpose data in to vectors
dft = pd.melt(df, id_vars=['mDate', 'DOW'], var_name='Status', value_name='mCount')
# Split Subfleets, join Legacy, remove 0 and NaN
dft[['Status', 'SubFleet']] = dft.Status.str.split(':',expand=True)
sDate = min(dft.mDate)
dft = dft.dropna()
dft = dft.reset_index(drop=True)
dft = dft[dft['mCount'] != 0]
dft = dft.reset_index(drop=True)
# Delete all data prior to today
dft = dft[dft['mDate'] >= datetime.combine(datetime.today(), time.min) ]
dft = dft.reset_index(drop=True)
I am wondering if there is a dependency that I need to explicitly import like I had to for the xlrd library.
Thanks for the assistance.
There ended up being an issue with the .exe not loading some of the dependencies for the libraries I needed. After explicitly calling the dependencies in my code, the .exe application worked perfectly.

convert pandas datetime64[ns] to matplotlib date-float for date x-axis in seaborn tsplot

Ok I'm trying to do something that should be trivial but instead I've spent more time than I'd like to admit searching google and stack overflow only to become more frustrated.
What I'm trying to do: I'd like to format my x-axis on a seaborn tsplot.
What my stack overflow searching has told me: matplot lib has a set_major_formattter function but I can't seem to use it without tripping an overflow error.
What I'm looking for: a simple way to convert datetime64[ns] to a float that can be used with marplot lib's set_major_formatter.
Where I think I'm stuck:
df.date_action = df.date_action.values.astype('float')
# converts the field to a float but matplotlib expects seconds since 0001-01-01 not nano seconds since epoch
is there a simple way to do this that I'm missing?
the most helpful post I reviewed so far was
31255815 which got me 95% of the way there but not quite
here is some sample code to illustrate the issue
# standard imports
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
## generate fake data
from datetime import timedelta, date
import random
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2013, 1, 1)
end_date = date(2018, 6, 2)
date_list = []
number_list = []
for single_date in daterange(start_date, end_date):
date_list.append(single_date)
if len(number_list) > 0:
number_list.append(random.random() + number_list[-1])
else:
number_list.append(random.random())
df = pd.DataFrame(data={'date_action': date_list, 'values': number_list})
# note my actual data comes in as a datetime64[ns]
df['date_action'] = df['date_action'].astype('datetime64[ns]')
# the following looked promising but is still offset an incorrect amount
#df.date_action = df.date_action.values.astype('float')
#df.date_action = df.date_action.to_datetime
## chart stuff
plt.clf()
import matplotlib.dates as mdates
df['dummy_01'] = 0
rows = 1
cols = 1
fig, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(10, 8))
ax1 = plt.subplot2grid((rows, cols), (0, 0))
for i in [ax1]: # trying to format x-axis
pass
i.xaxis_date()
i.xaxis.set_major_locator(mdates.AutoDateLocator())
i.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
sns.tsplot(df, time='date_action', unit='dummy_01',
value='values', ax=ax1) #
plt.plot()
plt.show()

How can I apply a couple of functions to multiple tickers in a list? (Code improvement)

So I'm currently learning how to analyse financial data in python using numpy, pandas, etc... and I'm starting off with a small script that will hopefully rank some chosen equities by the price change between 2 chosen dates.
My first script was:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
#Edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
#Getting the data:
AAPL = web.DataReader('AAPL', 'google', start, end)
GOOGL = web.DataReader('GOOGL', 'google', start, end)
YHOO = web.DataReader('YHOO', 'google', start, end)
MSFT = web.DataReader('MSFT', 'google', start, end)
AMZN = web.DataReader('AMZN', 'google', start, end)
DAI = web.DataReader('DAI', 'google', start, end)
#Calculating the change:
AAPLkey = (AAPL.ix[start]['Close'])/(AAPL.ix[end]['Close'])
GOOGLkey = (GOOGL.ix[start]['Close'])/(GOOGL.ix[end]['Close'])
YHOOkey = (YHOO.ix[start]['Close'])/(YHOO.ix[end]['Close'])
MSFTkey = (MSFT.ix[start]['Close'])/(MSFT.ix[end]['Close'])
AMZNkey = (AMZN.ix[start]['Close'])/(AMZN.ix[end]['Close'])
DAIkey = (DAI.ix[start]['Close'])/(DAI.ix[end]['Close'])
#Formatting the output in a sorted order:
dict1 = {"AAPL" : AAPLkey, "GOOGL" : GOOGLkey, "YHOO" : YHOOkey, "MSFT" : MSFTkey, "AMZN" : AMZNkey, "DAI" : DAIkey}
out = sorted(dict1.items(), key=itemgetter(1), reverse = True)
for tick , change in out:
print (tick,"\t", change)
I now obviously want to make this far shorter and this is what I've got so far:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
#Edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
for eq in stocks:
eq = web.DataReader(eq, 'google', start, end)
for legend in eq:
legend = (eq.ix[start]['Close'])/(eq.ix[end]['Close'])
print (legend)
The calculation works BUT the problem is this only outputs the last value for the item in the list (DAI).
So what's next in order to get the same result as my first code?
You can just move print statement into loop.
Like:
for legend in eq:
legend = (eq.loc[start]['Close'])/(eq.loc[end]['Close'])
print(legend)
Improved answer:
Get rid of label loop and print values from previous loop:
for eq in stocks:
df = web.DataReader(eq, 'google', start, end)
print((df.loc[start]['Close'])/(df.loc[end]['Close']))
When you loop over stocks at line for eq in stocks, you are saving results into eq. So at each iteration it gets overwritten. You should store the results in a list, like I have done using data.
Then loop over the data list which contains dataframes, and then use proper selection.
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
# edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
# store all the dataframes in a list
data = []
for eq in stocks:
data.append(web.DataReader(eq, 'google', start, end))
# print required fields from each dataframe
for df in data:
print (df.ix[start]['Close'])/(df.ix[end]['Close'])
Output:
0.624067042032
0.612014075932
0.613225417599
0.572179539021
0.340850298595
1.28323537643
Thanks to the other answers, they both helped lot. This is my final improved script thanks to that help:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
# edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
dict1 = {}
for eq in stocks:
df = web.DataReader(eq, 'google', start, end)
k = ((df.loc[start]['Close'])/(df.loc[end]['Close']))
dict1 [eq] = k
out = sorted(dict1.items(), key=itemgetter(1), reverse = True)
for tick , change in out:
print (tick,"\t", change)

Formatting index of a pandas table in a plot

I am trying to annotate my plot with part of a dataframe. However, the time 00:00:00 is appearing in all the row labels. Is there a clean way to remove them since my data is daily in frequency? I have tried the normalize function but that doesn't remove the time; it just zeroes the time.
Here is what the issue looks like and the sample code to reproduce the issue.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import table
# Setup of mock data
date_range = pd.date_range('2014-01-01', '2015-01-01', freq='MS')
df = pd.DataFrame({'Values': np.random.rand(0, 10, len(date_range))}, index=date_range)
# The plotting of the table
fig7 = plt.figure()
ax10 = plt.subplot2grid((1, 1), (0, 0))
table(ax10, np.round(df.tail(5), 2), loc='center', colWidths=[0.1] * 2)
fig7.show()
Simply access the .date attribute of the DateTimeIndex so that every individual element of your index would be represented in datetime.date format.
The default DateTimeIndex format is datetime.datetime which gets defined automatically even if you didn't explicitly define your index that way before.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import table
np.random.seed(42)
# Setup of mock data
date_range = pd.date_range('2014-01-01', '2015-01-01', freq='MS')
df = pd.DataFrame({'Values': np.random.rand(len(date_range))}, date_range)
df.index = df.index.date # <------ only change here
# The plotting of the table
fig7 = plt.figure()
ax10 = plt.subplot2grid((1, 1), (0, 0))
table(ax10, np.round(df.tail(5), 2), loc='center', colWidths=[0.1] * 2)
fig7.show()