I want to interpolate between times in a pandas time series. I would like to use scipy.interpolate.interp1d (or similar). The pandas.interpolate function is undesirable, as it would require inserting nan values and then replace them using interpolation (and thus modifying the dataset). What I tried to do:
from datetime import datetime
import scipy.interpolate as si
import pandas as pd
d1 = datetime(2019, 1, 1)
d2 = datetime(2019, 1, 5)
d3 = datetime(2019, 1, 10)
df = pd.DataFrame([1, 4, 2], index=[d1, d2, d3], columns=['conc'])
import scipy.interpolate as si
f = si.interp1d(df.index, df.conc)
f(datetime(2019, 1, 3))
All is good until the last line, where I get an ValueError: object arrays are not supported. Strangely enough f.x nicely shows the dates as dtype='datetime64[ns], so I was hoping it would work. Anybody know how to get this to work?
This works, but is arguably a bit ugly:
f = si.interp1d(pd.to_numeric(df.index), df.conc)
f(pd.to_numeric(pd.to_datetime(['2019-1-3'])))
Related
The data I tried to process are in replicates like this:
before
I want to plot them with error bar, can I plot them as it is?
I was trying to make them look like this, so I can plot them:
after
I tried to use pivot table, but seems like it only works on data with at least two labels
Thank you very much!
Try this (I checked it, and it works also with one label):
import pandas as pd
import numpy as np
t = ['treatment_1', 'treatment_1', 'treatment_1', 'treatment_2', 'treatment_2', 'treatment_2']
n = [10, 12, 13, 20, 22, 23]
df = pd.DataFrame(t, columns=['treatment'])
df['value'] = n
table = df.pivot(columns=['treatment']).replace(np.nan, 0)
table = pd.DataFrame({c: table.loc[table[c] != 0, c].tolist() for c in table})
print(table)
I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))
This is possible in pandas.
I would like to do it with dask.
Edit: raised on dask here
FYI you can go from an xarray.Dataset to a Dask.DataFrame
Pandas solution using .to_xarry:
import pandas as pd
import numpy as np
df = pd.DataFrame([('falcon', 'bird', 389.0, 2),
('parrot', 'bird', 24.0, 2),
('lion', 'mammal', 80.5, 4),
('monkey', 'mammal', np.nan, 4)],
columns=['name', 'class', 'max_speed',
'num_legs'])
df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
Dask solution?
import dask.dataframe as dd
ddf = dd.from_pandas(df, 1)
?
Could look a a solution using xarray but i think it only has .from_dataframe.
import xarray as xr
ds = xr.Dataset()
ds.from_dataframe(ddf.compute())
So this is possible and I've made a PR here that achieves it - https://github.com/pydata/xarray/pull/4659
It provides two methods Dataset.from_dask_dataframe and DataArray.from_dask_series.
The main reason behind not merging yet is that we're trying to compute the chunk sizes with as few computations of dask as possible.
There's some more context in these issues: https://github.com/pydata/xarray/issues/4650, https://github.com/pydata/xarray/issues/3929
I was looking for something similar and created this function (it is not perfect, but it works pretty well).
It also keeps all the dask data as dask arrays which saves memory etc.
import xarray as xr
import dask.dataframe as dd
def dask_2_xarray(ddf, indexname='index'):
ds = xr.Dataset()
ds[indexname] = ddf.index
for key in ddf.columns:
ds[key] = (indexname, ddf[key].to_dask_array().compute_chunk_sizes())
return ds
# use:
ds = dask_2_xarray(ddf)
Example:
path = LOCATION TO FILE
ddf_test = dd.read_hdf(path, key="/data*", sorted_index=True, mode='r')
ds = dask_2_xarray(ddf_test, indexname="time")
ds
Result:
Most time is spent computing the chunks sizes, so if somebody knows a better way to do that, it will be faster.
This method doesn't currently exist. If you think that it should exist then I encourage you to raise a github issue as a feature request. You might want to tag some Xarray people though.
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-02'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (736066.7, 736469.3)
Now, if we change the last date.
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (184.8, 189.2)
The first example seems consistent with the matplotlib docs:
Matplotlib represents dates using floating point numbers specifying the number of days since 0001-01-01 UTC, plus 1
Why does the second example return something seemingly completely different? I'm using pandas version 0.22.0 and matplotlib version 2.2.2.
In the second example, if you look at the plots, rather than giving dates matplotlib is giving quarter values:
The dates in this case are exactly six months and therefore two quarters apart, which is presumably why you're seeing this behavior. While I can't find it in the docs, the numbers given by xlim in this case are consistent with being the number of quarters since the Unix Epoch (Jan. 1, 1970).
Pandas uses different units to represents dates and times on the axes, depending on the range of dates/times in use. This means that different locators are in use.
In the first case,
print(ax.xaxis.get_major_locator())
# Out: pandas.plotting._converter.PandasAutoDateLocator
in the second case
print(ax.xaxis.get_major_locator())
# pandas.plotting._converter.TimeSeries_DateLocator
You may force pandas to always use the PandasAutoDateLocator using the x_compat argument,
df.plot(x_compat=True)
This would ensure to always get the same datetime definition, consistent with the matplotlib.dates convention.
The drawback is that this removes the nice quarterly ticking
and replaces it with the standard ticking
On the other hand it would then allow to use the very customizable matplotlib.dates tickers and formatters. For example to get quarterly ticks/labels
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot(x_compat=True)
# Quarterly ticks
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
# Formatting:
def func(x,pos):
q = (mdates.num2date(x).month-1)//3+1
tx = "Q{}".format(q)
if q == 1:
tx += "\n{}".format(mdates.num2date(x).year)
return tx
ax.xaxis.set_major_formatter(mticker.FuncFormatter(func))
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")
plt.show()
I'm trying to get the xlimits of a plot as a python datetime object from a time series plot created with pandas. Using ax.get_xlim() returns the axis limits as a numpy.float64, and I can't figure out how to convert the numbers to a usable datetime.
import pandas
from matplotlib import dates
import matplotlib.pyplot as plt
from datetime import datetime
from numpy.random import randn
ts = pandas.Series(randn(10000), index=pandas.date_range('1/1/2000',
periods=10000, freq='H'))
ts.plot()
ax = plt.gca()
ax.set_xlim(datetime(2000,1,1))
d1, d2 = ax.get_xlim()
print "%s(%s) to %s(%s)" % (d1, type(d1), d2, type(d2))
print "Using matplotlib: %s" % dates.num2date(d1)
print "Using datetime: %s" % datetime.fromtimestamp(d1)
which returns:
262968.0 (<type 'numpy.float64'>) to 272967.0 (<type 'numpy.float64'>)
Using matplotlib: 0720-12-25 00:00:00+00:00
Using datetime: 1970-01-03 19:02:48
According to the pandas timeseries docs, pandas uses the numpy.datetime64 dtype. I'm using pandas version '0.9.0'.
I am using get_xlim() instead directly accessing the pandas series because I am using the xlim_changed callback to do other things when the user moves around in the plot area.
Hack to get usable values
For the above example, the limits are returned in hours since the Epoch. So I can convert to seconds since the Epoch and use time.gmtime() to get somewhere usable, but this still doesn't feel right.
In [66]: d1, d2 = ax.get_xlim()
In [67]: time.gmtime(d1*60*60)
Out[67]: time.struct_time(tm_year=2000, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=1, tm_isdst=0)
The current behavior of matplotlib.dates:
datetime objects are converted to floating point numbers which represent time in days since 0001-01-01 UTC, plus 1. For example, 0001-01-01, 06:00 is 1.25, not 0.25. The helper functions date2num(), num2date() and drange() are used to facilitate easy conversion to and from datetime and numeric ranges.
pandas.tseries.converter.PandasAutoDateFormatter() seems to build on this, so:
x = pandas.date_range(start='01/01/2000', end='01/02/2000')
plt.plot(x, x)
matplotlib.dates.num2date(plt.gca().get_xlim()[0])
gives:
datetime.datetime(2000, 1, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x7ff73a60f290>)
# First convert to pandas Period
period = pandas.tseries.period.Period(ordinal=int(d1), freq=ax.freq)
# Then convert to pandas timestamp
ts = period.to_timestamp()
# Then convert to date object
dt = ts.to_datetime()