I have the following df :
A B
2018-01-02 100.000000 100.000000
2018-01-03 100.808036 100.325886
2018-01-04 101.616560 102.307700
I am looking forward to change the time format of the index, so I tried (using #jezrael s response in the link Format pandas dataframe index date):
df.index = rdo.index.strftime('%d-%m-%Y')
But it outputs :
AttributeError: 'Index' object has no attribute 'strftime'
My desired output would be:
A B
02-01-2018 100.000000 100.000000
03-01-2018 100.808036 100.325886
04-01-2018 101.616560 102.307700
I find quite similar the question asked in the link above with my issue. I do not really understand why the attrError arises.
Your index seems to be of a string (object) dtype, but it must be a DatetimeIndex, which can be checked by using df.info():
In [19]: df.index = pd.to_datetime(df.index).strftime('%d-%m-%Y')
In [20]: df
Out[20]:
A B
02-01-2018 100.000000 100.000000
03-01-2018 100.808036 100.325886
04-01-2018 101.616560 102.307700
Related
I have a dataframe df as:
df.iloc[1:5,1:3]
Date Month
4 2013-01-03 00:00:00 1
6 2013-01-04 00:00:00 1
10 2013-01-07 00:00:00 1
12 2013-01-08 00:00:00 1
I am trying the following:
df['newCol'] = df['Month']*2
I get the following warning:
<input>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
What is the correct way to do the above?
In this case, it is safe to assign the value the way you did. However, if you want to avoid the warning to keep the good habit, you can do what the message says, i.e.:
df.loc[:, 'newCol'] = df['Month']*2
This is my partial df=
dStart y_test y_pred
2018-01-01 1 2
2018-01-01 2 2
2018-01-02 3 3
2018-01-02 1 2
2018-01-02 2 3
I want to create a column in another dataframe (df1) with the Mathews Correlation Coefficient of each unique dStart.
from sklearn.metrics import matthews_corrcoef
def mcc_func(y_test,y_pred):
return matthews_corrcoef(df[y_test].values,df[y_pred].values)
df1['mcc']=df.groupby('dStart').apply(mcc_func('y_test','y_pred'))
This function doesn't work -- I think because the function returns a float, and 'apply' wants to use the function on the groupby data itself, but I can't figure out how to give the right function to apply.
You need to apply the function within the grouped object -
g = df.groupby('dStart')
g.apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
#OUTPUT
#dStart
#2018-01-01 0.0
#2018-01-02 0.0
#dtype: float64
Use apply with lambda function:
df = (df.groupby(['dStart']).apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
.reset_index(name='Matthews_corrcoef'))
print(df)
dStart Matthews_corrcoef
0 2018-01-01 0.0
1 2018-01-02 0.0
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494
I am attempting calculate the rolling auto-correlation for a Series object using Pandas (0.23.3)
Setting up the example:
dt_index = pd.date_range('2018-01-01','2018-02-01', freq = 'B')
data = np.random.rand(len(dt_index))
s = pd.Series(data, index = dt_index)
Creating a Rolling object with window size = 5:
r = s.rolling(5)
Getting:
Rolling [window=5,center=False,axis=0]
Now when I try to calculate the correlation (Pretty sure this is the wrong approach):
r.corr(other=r)
I get only NaNs
I tried another approach based on the documentation::
df = pd.DataFrame()
df['a'] = s
df['b'] = s.shift(-1)
df.rolling(window=5).corr()
Getting something like:
...
2018-03-01 a NaN NaN
b NaN NaN
Really not sure where I'm going wrong with this. Any help would be immensely appreciated! The docs use float64 as well. Thinking it's because the correlation is very close to zero and so it's showing NaN? Somebody had raised a bug report here, but jreback solved the problem in a previous bug fix I think.
This is another relevant answer, but it's using pd.rolling_apply, which does not seem to be supported in Pandas version 0.23.3?
IIUC,
>>> s.rolling(5).apply(lambda x: x.autocorr(), raw=False)
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 -0.502455
2018-01-08 -0.072132
2018-01-09 -0.216756
2018-01-10 -0.090358
2018-01-11 -0.928272
2018-01-12 -0.754725
2018-01-15 -0.822256
2018-01-16 -0.941788
2018-01-17 -0.765803
2018-01-18 -0.680472
2018-01-19 -0.902443
2018-01-22 -0.796185
2018-01-23 -0.691141
2018-01-24 -0.427208
2018-01-25 0.176668
2018-01-26 0.016166
2018-01-29 -0.876047
2018-01-30 -0.905765
2018-01-31 -0.859755
2018-02-01 -0.795077
This is a lot faster than Pandas' autocorr but the results are different. In my dataset, there is a 0.87 Pearson correlation between the results of those two methods. There is a discussion about why the results are different here.
from statsmodels.tsa.stattools import acf
s.rolling(5).apply(lambda x: acf(x, unbiased=True, fft=False)[1], raw=True)
Note that the input cannot have null values, otherwise it will return all nulls.
I'm trying to accomplish the following...
I got a Pandas dataframe that have a number of entries, indexed with DatetimeIndex which looks a bit like this:
bro_df.info()
<class 'bat.log_to_dataframe.LogToDataFrame'>
DatetimeIndex: 3596641 entries, 2017-12-14 13:52:01.633070 to 2018-01-03 09:59:53.108566
Data columns (total 20 columns):
conn_state object
duration timedelta64[ns]
history object
id.orig_h object
id.orig_p int64
id.resp_h object
id.resp_p int64
local_orig bool
local_resp bool
missed_bytes int64
orig_bytes int64
orig_ip_bytes int64
orig_pkts int64
proto object
resp_bytes int64
resp_ip_bytes int64
resp_pkts int64
service object
tunnel_parents object
uid object
dtypes: bool(2), int64(9), object(8), timedelta64[ns](1)
memory usage: 528.2+ MB
What I'm interested in is getting a slice of this data that takes the last entry, 2018-01-03 09:59:53.108566' in this case, and then subtracts an hour from that. This should give me the last hours worth of entries.
What I've tried to do so far is the following:
last_entry = bro_df.index[-1:]
first_entry = last_entry - pd.Timedelta('1 hour')
Which gives me what to me looks like fairly correct values, as per:
print(first_entry)
print(last_entry)
DatetimeIndex(['2018-01-03 08:59:53.108566'], dtype='datetime64[ns]', name='ts', freq=None)
DatetimeIndex(['2018-01-03 09:59:53.108566'], dtype='datetime64[ns]', name='ts', freq=None)
This is also sadly where I get stuck. I've tried various things with bro_df.loc and bro_df.iloc and so on but all I get is different errors for datatypes and not in index etc. Which leads me to think that I possibly might need to convert the first_entry, last_entry variables to another type?
Or I might as usual be barking up entirely the wrong tree.
Any assistance or guidance would be most appreciated.
Cheers, Mike
It seems you need create scalars by indexing [0] and select by loc:
df = bro_df.loc[first_entry[0]: last_entry[0]]
Or select by exact indexing:
df = bro_df[first_entry[0]: last_entry[0]]
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='2H 24T')
bro_df = pd.DataFrame({'a': range(10)}, index=rng)
print (bro_df)
a
2017-04-03 00:00:00 0
2017-04-03 02:24:00 1
2017-04-03 04:48:00 2
2017-04-03 07:12:00 3
2017-04-03 09:36:00 4
2017-04-03 12:00:00 5
2017-04-03 14:24:00 6
2017-04-03 16:48:00 7
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9
last_entry = bro_df.index[-1:]
first_entry = last_entry - pd.Timedelta('3 hour')
print (last_entry)
DatetimeIndex(['2017-04-03 21:36:00'], dtype='datetime64[ns]', freq='144T')
print (first_entry)
DatetimeIndex(['2017-04-03 18:36:00'], dtype='datetime64[ns]', freq=None)
print (last_entry[0])
2017-04-03 21:36:00
print (first_entry[0])
2017-04-03 18:36:00
df = bro_df.loc[first_entry[0]: last_entry[0]]
print (df)
a
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9
df1 = bro_df[first_entry[0]: last_entry[0]]
print (df1)
a
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9