pandas : reduction of timedelta64 using sum() results in int64? - numpy

According to the pandas 0.13.1 manual, you can reduce a numpy timedelta64 series:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-deltas-reductions
This seems to work fine with, for example, mean():
In[107]:
pd.Series(np.random.randint(0,100000,100).astype("timedelta64[ns]")).mean()
Out[107]:
0 00:00:00.000047
dtype: timedelta64[ns]
However, using sum(), this always results in an integer:
In [108]:
pd.Series(np.random.randint(0,100000,100).astype("timedelta64[ns]")).sum()
Out[108]:
5047226
Is this a bug, or is there e.g. overflow that is causing this? Is it safe to cast the result into timedelta64? How would I work around this?
I am using numpy 1.8.0.

Looks like a bug, just filed this: https://github.com/pydata/pandas/issues/6462
The results are in nanoseconds; as a work-around you can do this:
In [1]: s = pd.to_timedelta(range(4),unit='d')
In [2]: s
Out[2]:
0 0 days
1 1 days
2 2 days
3 3 days
dtype: timedelta64[ns]
In [3]: s.mean()
Out[3]:
0 1 days, 12:00:00
dtype: timedelta64[ns]
In [4]: s.sum()
Out[4]: 518400000000000
In [8]: pd.to_timedelta([s.sum()])
Out[8]:
0 6 days
dtype: timedelta64[ns]

Related

Pandas.round() doesn't work on my dataset

Unfortunately pandas.round() doesn`t work on my dataset
df.balance
Out[1]:
0 17173.71
1 17173.71
2 17173.71
Name: balance, dtype: float64
df.balance[0]
Out[2]: 17173.709999999999
df = df.round({'balance': 2})
df.balance
Out[4]:
0 17173.71
1 17173.71
2 17173.71
Name: balance, dtype: float64
df.balance[0]
Out[5]: 17173.709999999999
Python 2.7.10 and Pandas 0.19
Thanks
That's actually most accurate representation of what 17173.71 could be:
01000000 11010000 11000101 01101101
01110000 10100011 11010111 00001010
which is 1.7173709999999999126885086298E4. You cannot represent 17173.71 exactly. It works fine.
You might be confused why you see 17173.71 when displaying pandas Series but 17173.709999999999 when displaying exact value. It is a result of pandas formating. Try:
pd.options.display.float_format = '{:.60f}'.format
Then try displaying Series again

How to show multiple timeseries plots using seaborn

I'm trying to generate 4 plots from a DataFrame using Seaborn
Date A B C D
2019-04-05 330.665 161.975 168.69 0
2019-04-06 322.782 150.243 172.539 0
2019-04-07 322.782 150.243 172.539 0
2019-04-08 295.918 127.801 168.117 0
2019-04-09 282.674 126.894 155.78 0
2019-04-10 293.818 133.413 160.405 0
I have casted dates using pd.to_DateTime and numbers using pd.to_numeric. Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 460 to 465
Data columns (total 5 columns):
Date 6 non-null datetime64[ns]
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 6 non-null float64
dtypes: datetime64[ns](1), float64(4)
memory usage: 288.0 bytes
I can do a wide column plot by just calling .plot() on df.
However,
The legend of the plot is covering the plot itself
I would instead like to have 4 separate plots in 1 diagram and have tried using lmplot to achieve this.
I would like to add labels to the plot like so:
Plot with image
I first melted the data:
df=pd.melt(df,id_vars='Date', var_name='Var', value_name='Unit')
And then tried lmplot
sns.lmplot(x = df['Date'], y='Unit', col='Var', data=df)
However, I get the traceback:
TypeError: Invalid comparison between dtype=datetime64[ns] and str
I have also tried setting df.set_index['Date'] and replotting that using x=df.index and that gave me the same error.
The data can be plotted using Google Sheets but I am trying to automate a workflow where the chart can be generated and sent via Slack to selected recipients.
I hope I have expressed myself clearly enough as I am rather new to Python and Seaborn and hope to get some help from the experts here.
Regarding the legend you can just use .legend(loc="upper left", bbox_to_anchor=(1,1)) as in this example
%matplotlib inline
import pandas as pd
import numpy as np
data = np.random.rand(10,4)
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
df.plot()\
.legend(loc="upper left", bbox_to_anchor=(1,1));
While for the second IIUC you can play from
df.plot(subplots=True, layout=(2,2));

dask dataframes -time series partitions

I have a timeseries pandas dataframe that I want to partition by month and year. My thought was to get a list of datetimes that would serve as the index but the break doesnt happen at the start 0:00 at the first of the month..
monthly_partitons=np.unique(df.index.values.astype('datetime64[M]')).tolist()
da=dd.from_pandas(df, npartitions=1)
how do I set the index to start at each month? I tried npartitions=len(monthly_partitions) but I realize that is wrong as the it may not partition on the date at start time. how should one ensure it partiitons on the first date of the month?
UPDATE:
using da=da.repartition(freq='1M') resampled the data from 10 minutes data to 1 minute data see below
Dask DataFrame Structure:
Open High Low Close Vol OI VI
npartitions=5037050
2008-05-04 18:00:00 float64 float64 float64 float64 int64 int64 float64 int32
2008-05-04 18:01:00 ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
2017-12-01 16:49:00 ... ... ... ... ... ... ... ...
2017-12-01 16:50:00 ... ... ... ... ... ... ... ...
Dask Name: repartition-merge, 10074101 tasks
UPDATE 2:
Here is the code to reproduce the problem
import pandas as pd
import datetime as dt
import dask as dsk
import numpy as np
import dask.dataframe as dd
ts=pd.date_range("2015-01-01 00:00", " 2015-05-01 23:50", freq="10min")
df = pd.DataFrame(np.random.randint(0,100,size=(len(ts),4)), columns=list('ABCD'), index=ts)
ddf=dd.from_pandas(df,npartitions=1)
ddf=ddf.repartition(freq='1M')
ddf
Assuming your dataframe is already indexed by time you should be able to use the repartition method to accomplish this.
df = df.repartition(freq='1M')
Edit after MCVE above
(thanks for adding the minimal and complete example!)
Interesting, this looks like a bug, either in pandas or dask. I assumed that '1M' would mean one month, (as it does in pd.date_range)
In [12]: pd.date_range('2017-01-01', '2017-12-15', freq='1M')
Out[12]:
DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
'2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
'2017-09-30', '2017-10-31', '2017-11-30'],
dtype='datetime64[ns]', freq='M')
And yet, when passed to pd.Timedelta, it means one minute
In [13]: pd.Timedelta('1M')
Out[13]: Timedelta('0 days 00:01:00')
In [14]: pd.Timedelta('1m')
Out[14]: Timedelta('0 days 00:01:00')
So it's hanging because it's trying to make around 43200 more partitions than you intended :)
We should file a bug report for this (do you have any interest in doing this?). A short term workaround would be to specify divisions yourself explicitly.
In [17]: divisions = pd.date_range('2015-01-01', '2015-05-01', freq='1M').tolist
...: ()
...: divisions[0] = ddf.divisions[0]
...: divisions[-1] = ddf.divisions[-1]
...: ddf.repartition(divisions=divisions)
...:
Out[17]:
Dask DataFrame Structure:
A B C D
npartitions=3
2015-01-01 00:00:00 int64 int64 int64 int64
2015-02-28 00:00:00 ... ... ... ...
2015-03-31 00:00:00 ... ... ... ...
2015-05-01 23:50:00 ... ... ... ...
Dask Name: repartition-merge, 7 tasks
If you would like to partition by the first day of each month then use the following:
ddf.repartition(freq='MS')
where MS means month start. Information on more DateOffset objects can be found in the pandas docs

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))

detecting jumps on pandas index dates

I managed to load historical data on data series on a large set of financial instruments, indexed by date.
I am plotting volume , price information without any issue.
What I want to achieve now is to determine if there is any big jump in dates, to see if I am missing large chunks of data.
The idea I had in mind was somehow to plot the difference in between two consecutive dates in the index and if the number is superior to 3 or 4 ( which is bigger than a week end and a bank holiday on a friday or monday ) then there is an issue.
Problem is I can figure out how do compute simply df[next day]-df[day], where df is indexed by day
You can use the shift Series method (note the DatetimeIndex method shifts by freq):
In [11]: rng = pd.DatetimeIndex(['20120101', '20120102', '20120106']) # DatetimeIndex like df.index
In [12]: s = pd.Series(rng) # df.index instead of rng
In [13]: s - s.shift()
Out[13]:
0 NaT
1 1 days, 00:00:00
2 4 days, 00:00:00
dtype: timedelta64[ns]
In [14]: s - s.shift() > pd.offsets.Day(3).nanos
Out[14]:
0 False
1 False
2 True
dtype: bool
Depending on what you want, perhaps you could either do any, or find the problematic values...
In [15]: (s - s.shift() > pd.offsets.Day(3).nanos).any()
Out[15]: True
In [16]: s[s - s.shift() > pd.offsets.Day(3).nanos]
Out[16]:
2 2012-01-06 00:00:00
dtype: datetime64[ns]
Or perhaps find the maximum jump (and where it is):
In [17]: (s - s.shift()).max() # it's weird this returns a Series...
Out[17]:
0 4 days, 00:00:00
dtype: timedelta64[ns]
In [18]: (s - s.shift()).idxmax()
Out[18]: 2
If you really wanted to plot this, simply plotting the difference would work:
(s - s.shift()).plot()