Pandas Resample Strange Zero Tolerance Behavior - pandas

I'm attempting to resample a time series in Pandas and I am getting some odd behavior:
print samples[196].base_df.to_string()
Units Sales
2008-07-03 3 820.00
2008-07-04 3 470.00
...
2010-06-22 1 335.00
2010-06-24 2 180.00
2010-06-30 -1 -2502.00
print samples[196].base_df.resample('15d', how='sum')
Units Sales
2008-07-03 17 3.149130e+03
2008-07-18 29 6.305210e+03
...
2010-06-08 18 5.204000e+03
2010-06-23 1 -2.322000e+03
2010-07-08 0 6.521324e-312
I would have expected the last value in the resampled series to be either zero or omitted. Is this expected behavior for the resample function? If helpful I can post the full time series, but it is a bit long...

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

Pandas subtract dates to get a surgery patient length of stay

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.
Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.

Datetime reformat weekly column

I split a dataframe from minute to daily, weekly and monthly. I had no problem to reformat the daily dataframe, though I am having a good bit of trouble trying to do the same with the weekly one. Here is the weekly dataframe if someone could help me out please, it would be great. I am adding the code I used to reformat the daily dataframe, so it may helps!
I am plotting it with Bokeh and without the datetime format I won't be able to format the axis and hovertools as I would like.
Thanks beforehand.
dfDay1 = dfDay.loc['2014-01-01':'2020-09-31']
dfDay1 = dfDay1.reset_index()
dfDay1['date1'] = pd.to_datetime(dfDay1['date'], format=('%Y/%m/%d'))
dfDay1 = dfDay1.set_index('date')
That worked fine for the day format.
If need convert dates before / use Series.str.split with str[0], if dates after / use str[1]:
df['date1'] = pd.to_datetime(df['week'].str.split('/').str[0])
print (df)
week Open Low High Close Volume \
0 2014-01-07/2014-01-13 58.1500 55.38 58.96 56.0000 324133239
1 2014-01-14/2014-01-20 56.3500 55.96 58.57 56.2500 141255151
2 2014-01-21/2014-01-27 57.8786 51.85 59.31 52.8600 279370121
3 2014-01-28/2014-02-03 53.7700 52.75 63.95 62.4900 447186604
4 2014-02-04/2014-02-10 62.8900 60.45 64.90 63.9100 238316161
.. ... ... ... ... ... ...
347 2020-09-01/2020-09-07 297.4000 271.14 303.90 281.5962 98978386
348 2020-09-08/2020-09-14 275.0000 262.64 281.40 271.0100 109717114
349 2020-09-15/2020-09-21 272.6300 244.13 274.52 248.5800 123816172
350 2020-09-22/2020-09-28 254.3900 245.40 259.98 255.8800 98550687
351 2020-09-29/2020-10-05 258.2530 256.50 268.33 261.3500 81921670
date1
0 2014-01-07
1 2014-01-14
2 2014-01-21
3 2014-01-28
4 2014-02-04
.. ...
347 2020-09-01
348 2020-09-08
349 2020-09-15
350 2020-09-22
351 2020-09-29
[352 rows x 7 columns]

When plotting a dataframe, how to set the x-range for a 'YYYY-MM' value

I have a pandas df with the below values. I can create a nifty chart that looks like the following:
import matplotlib.pyplot as plt
ax = pdf_month.plot(x="month", y="count", kind="bar")
plt.show()
I want to truncate the date range (to ignore 1900-01-01 and other months that not import, but everytime I try I get error messages (see below). The date range would be something like '2016-01' to '2018-04'
ax.set_xlim(pdf_month['month'][17],pdf_date['count'].values.max())
where pdf_month['month'][17] gives you a value of u'2017-01'.
pdf_month.printSchema
root
|-- month: string (nullable = true)
|-- count: long (nullable = false)
How do I set the range on the month values for a x-value that isn't really an int or a date. I still have the original, pre-grouped dates. Is there a better way to group by month that would allow you to customize the x-axis?
error messages:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
sample output of pd_month
month count
0 1900-01 353
1 2015-09 1
2 2015-10 2
3 2015-11 2
4 2015-12 1
5 2016-01 1
6 2016-02 1
7 2016-03 3
8 2016-04 2
9 2016-05 5
10 2016-06 7
11 2016-07 13
12 2016-08 12
13 2016-09 41
14 2016-10 19
15 2016-11 17
16 2016-12 20
You can try Series date indexing, Pandas Series allow for date slicing as follows:
df.month['2016-01': '2018-04']
This works with datetime indexes.

Pandas add column based on grouped by rolling average

I have successfully added a new summed Volume column using Transform when grouping by Date like so:
df
Name Date Volume
--------------------------
APL 12-01-2017 1102
BSC 12-01-2017 4500
CDF 12-02-2017 5455
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
Name Date Volume vol_all_daily
------------------------------------------
APL 12-01-2017 1102 5602
BSC 12-01-2017 4500 5602
CDF 12-02-2017 5455 5455
However when I want to take the rolling average it doesn't work!
df['vol_all_ma_2']=df['vol_all_daily'].
groupby([df['Date']]).rolling(window=2).mean()
Returns a DataGroupBy that gives error *and becomes too hard to put back into a df column anyways.
df['vol_all_ma_2'] =
df['vol_all_daily'].groupby([df['Date']]).transform('mean').
rolling(window=2).mean()
This just produces near identical result of vol_all_daily column
Update:
I wasn't taking the just one column per date..The above code will still take multiple dates...Instead I add the .first() to the groupby..Not sure why groupby isnt taking one row per date.
The behavior of what you have written seems correct (Part 1 below), but perhaps you want to be calling something different (Part 2 below).
Part 1: Why what you have written is behaving correctly:
d = {'Name':['APL', 'BSC', 'CDF'],'Date':pd.DatetimeIndex(['2017-12-01', '2017-12-01', '2017-12-02']),'Volume':[1102,4500,5455]}
df = pd.DataFrame(d)
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
print(df)
rolling_vol = df['vol_all_daily'].groupby([df['Date']]).rolling(window=2).mean()
print('')
print(rolling_vol)
I get as output:
Date Name Volume vol_all_daily
0 2017-12-01 APL 1102 5602
1 2017-12-01 BSC 4500 5602
2 2017-12-02 CDF 5455 5455
Date
2017-12-01 0 NaN
1 5602.0
2017-12-02 2 NaN
Name: vol_all_daily, dtype: float64
To understand why this result rolling_vol is correct, notice that you have first called the groupby, and only after that you have called rolling. That should not produce something that fits with df.
Part 2: What I think you wanted to call (just a rolling average):
If you instead run:
# same as above but without groupby
rolling_vol2 = df['vol_all_daily'].rolling(window=2).mean()
print('')
print(rolling_vol2)
You should get:
0 NaN
1 5602.0
2 5528.5
Name: vol_all_daily, dtype: float64
which looks more like the rolling average you seem to want. To explain that, I suggest reading the details of pandas resampling vs rolling.