Python Pandas SettingWithCopyWarning while creating new column - python-3.8

I have a dataframe df as:
df.iloc[1:5,1:3]
Date Month
4 2013-01-03 00:00:00 1
6 2013-01-04 00:00:00 1
10 2013-01-07 00:00:00 1
12 2013-01-08 00:00:00 1
I am trying the following:
df['newCol'] = df['Month']*2
I get the following warning:
<input>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
What is the correct way to do the above?

In this case, it is safe to assign the value the way you did. However, if you want to avoid the warning to keep the good habit, you can do what the message says, i.e.:
df.loc[:, 'newCol'] = df['Month']*2

Related

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Pandas add column based on grouped by rolling average

I have successfully added a new summed Volume column using Transform when grouping by Date like so:
df
Name Date Volume
--------------------------
APL 12-01-2017 1102
BSC 12-01-2017 4500
CDF 12-02-2017 5455
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
Name Date Volume vol_all_daily
------------------------------------------
APL 12-01-2017 1102 5602
BSC 12-01-2017 4500 5602
CDF 12-02-2017 5455 5455
However when I want to take the rolling average it doesn't work!
df['vol_all_ma_2']=df['vol_all_daily'].
groupby([df['Date']]).rolling(window=2).mean()
Returns a DataGroupBy that gives error *and becomes too hard to put back into a df column anyways.
df['vol_all_ma_2'] =
df['vol_all_daily'].groupby([df['Date']]).transform('mean').
rolling(window=2).mean()
This just produces near identical result of vol_all_daily column
Update:
I wasn't taking the just one column per date..The above code will still take multiple dates...Instead I add the .first() to the groupby..Not sure why groupby isnt taking one row per date.
The behavior of what you have written seems correct (Part 1 below), but perhaps you want to be calling something different (Part 2 below).
Part 1: Why what you have written is behaving correctly:
d = {'Name':['APL', 'BSC', 'CDF'],'Date':pd.DatetimeIndex(['2017-12-01', '2017-12-01', '2017-12-02']),'Volume':[1102,4500,5455]}
df = pd.DataFrame(d)
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
print(df)
rolling_vol = df['vol_all_daily'].groupby([df['Date']]).rolling(window=2).mean()
print('')
print(rolling_vol)
I get as output:
Date Name Volume vol_all_daily
0 2017-12-01 APL 1102 5602
1 2017-12-01 BSC 4500 5602
2 2017-12-02 CDF 5455 5455
Date
2017-12-01 0 NaN
1 5602.0
2017-12-02 2 NaN
Name: vol_all_daily, dtype: float64
To understand why this result rolling_vol is correct, notice that you have first called the groupby, and only after that you have called rolling. That should not produce something that fits with df.
Part 2: What I think you wanted to call (just a rolling average):
If you instead run:
# same as above but without groupby
rolling_vol2 = df['vol_all_daily'].rolling(window=2).mean()
print('')
print(rolling_vol2)
You should get:
0 NaN
1 5602.0
2 5528.5
Name: vol_all_daily, dtype: float64
which looks more like the rolling average you seem to want. To explain that, I suggest reading the details of pandas resampling vs rolling.

pandas HDFStore select rows by datetime index

I'm sure this is probably very simple but I can't figure out how to slice a pandas HDFStore table by its datetime index to get a specific range of rows.
I have a table that looks like this:
mdstore = pd.HDFStore(store.h5)
histTable = '/ES_USD20120615_MIDPOINT30s'
print(mdstore[histTable])
open high low close volume WAP \
date
2011-12-04 23:00:00 1266.000 1266.000 1266.000 1266.000 -1 -1
2011-12-04 23:00:30 1266.000 1272.375 1240.625 1240.875 -1 -1
2011-12-04 23:01:00 1240.875 1242.250 1240.500 1242.125 -1 -1
...
[488000 rows x 7 columns]
For example I'd like to get the range from 2012-01-11 23:00:00 to 2012-01-12 22:30:00. If it were in a df I would just use datetimes to slice on the index, but I can't figure out how to do that directly from the store table so I don't have to load the whole thing into memory.
I tried mdstore.select(histTable, where='index>20120111') and that worked in as much as I got everything on the 11th and 12th, but I couldn't see how to add a time in.
Example is here
needs pandas >= 0.13.0
In [2]: df = DataFrame(np.random.randn(5),index=date_range('20130101 09:00:00',periods=5,freq='s'))
In [3]: df
Out[3]:
0
2013-01-01 09:00:00 -0.110577
2013-01-01 09:00:01 -0.420989
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
2013-01-01 09:00:04 -0.830469
[5 rows x 1 columns]
In [4]: df.to_hdf('test.h5','data',mode='w',format='table')
Specify it as a quoted string
In [8]: pd.read_hdf('test.h5','data',where='index>"20130101 09:00:01" & index<"20130101 09:00:04"')
Out[8]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]
You can also specify it directly as a Timestamp
In [10]: pd.read_hdf('test.h5','data',where='index>Timestamp("20130101 09:00:01") & index<Timestamp("20130101 09:00:04")')
Out[10]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]