I have two dataframes df indexed by Datetime and df2 which has column Date (Series).
Before resampling I can run:
>>>> df[df2['Date'][0]]
and obtain all rows corresponding to day df2['Date'][0] which is 2013-08-07 in this example. However after resampling by day I can no longer obtain the row corresponding to that day as:
>>>> df.resample('D', how=np.max)[df2['Date'][0]]
KeyError: u'no item named 2013-08-07'
although that day is in the dataset
>>>> df.resample('D', how=np.max).head()
| Temp | etc
Date | |
---------------------------
2013-08-07 | 26.1 |
---------------------------
2013-08-08 | 28.2 |
---------------------------
etc
I am not sure whether it is a bug or it is designed to be like this, or, if the latter is true, why. But you can do this to get the desired result:
In [168]:
df1=pd.DataFrame(np.random.random(100), columns=['Temp'])
df1.index=pd.date_range('2013-08-07',periods=100,freq='5H')
df1.index.name='Date'
In [169]:
df2=pd.DataFrame(pd.date_range('2013-08-07',periods=23, freq='D'), columns=['Date'])
In [170]:
#You can do this
df3=df1.resample('D', how=np.max)
print df3[df3.index==df2['Date'][0]]
Temp
Date
2013-08-07 0.8128
[1 rows x 1 columns]
In [171]:
df3[df2['Date'][0]]
#Error
Related
I have a table df with columns "timestamp" and "Y". I want to add another column "MaxY" which contains the largest Y value at most 24 hours in the future. That is
df.MaxY.iloc[i] = df[(df.timestamp > df.timestamp.iloc[i]) &
(df.timestamp < df.timestamp.iloc[i] + timedelta(hours=24))].Y.max()
Obviously, computing it like that is very slow. Is there a better way?
In a similar case of computing "SumY" I can do it using a trick with cumsum(). However here similar tricks don't seem to work.
As requested, an example table (MaxY is the output. Input is the first two columns only).
-------------------------------
| timestamp | Y | MaxY |
-------------------------------
| 2016-03-29 12:00 | 1 | 3 | rows 2 and 3 fall within 24 hours, so MaxY = max(2,3)
| 2016-03-29 13:00 | 2 | 4 | rows 3 and 4 fall in the time interval, so MaxY = max(3, 4)
| 2016-03-30 11:00 | 3 | 4 | rows 4, 5, 6 all fall in the interval so MaxY = max(4, 3, 2)
| 2016-03-30 12:30 | 4 | 3 | max (3, 2)
| 2016-03-30 13:30 | 3 | 2 | row 6 is the only row in the interval
| 2016-03-30 14:00 | 2 | nan? | there are no rows in the time interval. Any value will do.
-------------------------------
Here's a way with resample/rolling. I get a weird warning using pandas version 0.18.0 and python 3.5. I don't think it's a concern but not sure why it is generated.
This assumes index is 'timestamp', if not, precede the following with df = df.set_index('timestamp'):
>>> df2 = df.resample('30min').sort_index(ascending=False).fillna(np.nan)
>>> df2 = df2.rolling(48,min_periods=1).max()
>>> df.join(df2,rsuffix='2')
Y Y2
timestamp
2016-03-29 12:00:00 1 3.0
2016-03-29 13:00:00 2 4.0
2016-03-30 11:00:00 3 4.0
2016-03-30 12:30:00 4 4.0
2016-03-30 13:30:00 3 3.0
2016-03-30 14:00:00 2 2.0
On this tiny dataframe it seems to be about twice as fast, but you'd have to test it on a larger dataframe to get a reasonable idea of relative speed.
Hopefully this is somewhat self expanatory. The ascending sort is necessary because rolling only allows a backwards or centered window as far as I can tell.
Consider an apply() solution that may run faster. Function returns the max of a time-conditional series from each row.
import pandas as pd
from datetime import timedelta
def daymax(row):
ser = df.Y[(df.timestamp > row) &
(df.timestamp <= row + timedelta(hours=24))]
return ser.max()
df['MaxY'] = df.timestamp.apply(daymax)
print(df)
# timestamp Y MaxY
#0 2016-03-29 12:00:00 1 3.0
#1 2016-03-29 13:00:00 2 4.0
#2 2016-03-30 11:00:00 3 4.0
#3 2016-03-30 12:30:00 4 3.0
#4 2016-03-30 13:30:00 3 2.0
#5 2016-03-30 14:00:00 2 NaN
what's wrong with
df['MaxY'] = df[::-1].Y.shift(-1).rolling('24H').max()
df[::-1] reverses the df (you want it "backwards") and shift(-1) takes care of the "in the future".
I have a pandas dataframe
published | sentiment
2022-01-31 10:00:00 | 0
2021-12-29 00:30:00 | 5
2021-12-20 | -5
Since some rows don't have hours, minutes and seconds I delete them:
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
I get:
published | sentiment
2022-01-31 | 0
2021-12-29 | 5
2021-12-20 | -5
If I plot the data:
plt.pyplot.plot_date(df['published'],df['sentiment'] )
I get this error:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
But I don't know why since it should be a string.
How can I plot it (possibly keeping the temporal order)? Thank you
Try like this:
import pandas as pd
from matplotlib import pyplot as plt
values=[('2022-01-31 10:00:00',0),('2021-12-29 00:30:00',5),('2021-12-20',-5)]
cols=['published','sentiment']
df_dominant_topic2 = pd.DataFrame.from_records(values, columns=cols)
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
#you may sort the data by date
df_dominant_topic2.sort_values(by='published', ascending=True, inplace=True)
plt.plot(df_dominant_topic2['published'],df_dominant_topic2['sentiment'])
plt.show()
Dataframe A ('df_a') contains location-split temperature values at re-sampled 5-minute intervals:
logtime_round | location | value
2017-05-01 06:05:00 | 0 | 17
2017-05-01 06:05:00 | 1 | 14.5
2017-05-01 06:05:00 | 2 | 14.5
etc...
Dataframe B ('df_b') contains temperature values (re-sampled from hourly to daily):
logtime_round | airtemp
2017-05-01 | 10.33333
2017-05-02 | 10.42083
etc...
I have manipulated df_b so that only airtemp (format: datetime64[ns]) <= 15.5 are included, and now would like to manipulate df_a so that a new dataframe is created featuring only the same days included in df_b (I'm only interested in locations and values when outdoor air temperature was below <= 15.5).
Is this possible?
My first plan was to join the two dataframes and then look to remove any NaN airtemp values to get my desired df, however, the df_b airtemp is only featured for the first row (e.g. for 2017-05-01) with the rest as NaNs. So perhaps the df_b daily airtemp can be duplicated across all rows in the same day?
joindf = df_a.join(df_b)
Thanks!
Use merge_asof (assuming both frames have been sorted by time):
pd.merge_asof(df_a, df_b, on='logtime_round')
I'm attempting to create a new column that contains the data of the Date input column as a datetime. I'd also happily accept changing the datatype of the Date column, but I'm just as unsure how to to that.
I'm currently using DateTime = dd.to_datetime. I'm importing from a CSV and letting dask decide on data types.
I'm fairly new to this, so I've tried a few stackoverflow answers, but I'm just fumbling and getting more errors than answers.
My input date string is, for example:
2019-20-09 04:00
This is what I currently have,
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
# Dataframes implement the Pandas API
import dask.dataframe as dd
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\State_Weathergrids.csv')
print(ddf.describe(include='all'))
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%y-%d-%m %H:%M')
The error I'm receiving is below. I 'm assuming that the last line is the most relevant piece, but for the life of me I cannot work out why the date format given doesn't match the format I'm specifying.
TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
290 try:
--> 291 values, tz = conversion.datetime_to_datetime64(arg)
292 return DatetimeIndex._simple_new(values, name=name, tz=tz)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
....
ValueError: time data '2019-20-09 04:00' does not match format '%y-%d-%m %H:%M' (match)
Data Frame current properties using describe:
Dask DataFrame Structure:
Location Date Temperature RH
npartitions=1
float64 object float64 float64
... ... ... ...
Dask Name: describe, 971 tasks
Sample Data
+-----------+------------------+-------------+--------+
| Location | Date | Temperature | RH |
+-----------+------------------+-------------+--------+
| 1075 | 2019-20-09 04:00 | 6.8 | 99.3 |
| 1075 | 2019-20-09 05:00 | 6.4 | 100.0 |
| 1075 | 2019-20-09 06:00 | 6.7 | 99.3 |
| 1075 | 2019-20-09 07:00 | 8.6 | 95.4 |
| 1075 | 2019-20-09 08:00 | 12.2 | 76.0 |
+-----------+------------------+-------------+--------+
Try this,
['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M', errors = 'ignore')
errors ignore will return Nan wherever to_datetime fails..
For more detail visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
let's say I have the following table of customer data
df = pd.DataFrame.from_dict({"Customer":[0,0,1],
"Date":['01.01.2016', '01.02.2016', '01.01.2016'],
"Type":["First Buy", "Second Buy", "First Buy"],
"Value":[10,20,10]})
which looks like this:
Customer | Date | Type | Value
-----------------------------------------
0 |01.01.2016|First Buy | 10
-----------------------------------------
0 |01.02.2016|Second Buy| 20
-----------------------------------------
1 |01.01.2016|First Buy | 10
I want to pivot the table by the Type column.
However, the pivoting only gives the numeric Value columns as a result.
I'd desire a structure like:
Customer | First Buy Date | First Buy Value | Second Buy Date | Second Buy Value
---------------------------------------------------------------------------------
where the missing values are NAN or NAT
Is this possible using pivot_table. If not, I can imagine some workarounds, but they are quite lenghty. Any other suggestions?
Use unstack:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns]
print (df1)
Date_First Buy Date_Second Buy Value_First Buy Value_Second Buy
Customer
0 01.01.2016 01.02.2016 10.0 20.0
1 01.01.2016 None 10.0 NaN
If need another order of columns use swaplevel and sort_index:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns.swaplevel(0,1)]
df1.sort_index(axis=1, inplace=True)
print (df1)
First Buy_Date First Buy_Value Second Buy_Date Second Buy_Value
Customer
0 01.01.2016 10.0 01.02.2016 20.0
1 01.01.2016 10.0 None NaN