Pandas DatetimeIndex to dataframe - pandas

How do you change a DatetimeIndex into a simple dataframe like this:
month
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
This is the DatetimeIndex:
DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
Thank you.

Code below should work:
# Import libraries
import pandas as pd
import numpy as np
import datetime as dt
# Create dataframe
df = pd.DataFrame(pd.DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M'), columns=['month'])
df.head(2)

Use DataFrame contructor:
idx = pd.DatetimeIndex(['2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
'2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28',
'2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30'],
dtype='datetime64[ns]', freq='M')
df = pd.DataFrame({'month':idx})
#alternative
#df = pd.DataFrame({'month':df1.index})
print (df)
month
0 2013-07-31
1 2013-08-31
2 2013-09-30
3 2013-10-31
4 2013-11-30
5 2013-12-31
6 2014-01-31
7 2014-02-28
8 2014-03-31
9 2014-04-30
10 2014-05-31
11 2014-06-30

Related

Python DataFrame: How to write and read multiple tickers time-series dataframe?

This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

converting ddmmyy into mmyy format by using pandas?

i have column(month) in the ddmmyy format, how i can convert that into mmyy format.
Month
6/1/2017
5/1/2017
i have used below code, can someone help
import pandas as pd
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv")
df['Month']=pd.to_datetime(df['Month'],format='%d/%m/%y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
I think you can convert column to datetimes in read_csv by parameter parse_dates and dayfirst and then convert to custom format by strftime:
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv", parse_dates=['Month'], dayfirst=True)
df['Month']= df['Month'].dt.strftime('%b %y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
Your code:
df = pd.read_csv(r"C:\Users\venkagop\Subbu\UK_IYA.csv")
df['Month']=pd.to_datetime(df['Month'],format='%d/%m/%y').dt.strftime('%b %y')
df.to_csv(r"C:\Users\venkagop\Subbu\my test.csv")
Sample:
import pandas as pd
temp=u"""Month,sale
05/03/12,2
05/04/12,4
05/05/12,6
05/06/12,8"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Month'], dayfirst=True)
print (df)
Month sale
0 2012-03-05 2
1 2012-04-05 4
2 2012-05-05 6
3 2012-06-05 8
df['Month']= df['Month'].dt.strftime('%b %y')
print (df)
Month sale
0 Mar 12 2
1 Apr 12 4
2 May 12 6
3 Jun 12 8

Finding max date of the month in a list of pandas timeseries dates

I have a timeseries without every date (ie. trading dates). Series can be reproduced here.
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
I would like the last day of the month in my list of dates ie: '2010-01-29' and '2010-02-16'
I have looked at Get the last date of each month in a list of dates in Python
and more specifically...
import pandas as pd
import numpy as np
df = pd.read_csv('/path/to/file/') # Load a dataframe with your file
df.index = df['my_date_field'] # set the dataframe index with your date
dfg = df.groupby(pd.TimeGrouper(freq='M')) # group by month / alternatively use MS for Month Start / referencing the previously created object
# Finally, find the max date in each month
dfg.agg({'my_date_field': np.max})
# To specifically coerce the results of the groupby to a list:
dfg.agg({'my_date_field': np.max})['my_date_field'].tolist()
... but can't quite figure out how to adapt this to my application. Thanks in advance.
You can try the following to get your desired output:
import numpy as np
import pandas as pd
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
This:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True)
Or this:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))]
Both should yield something like the following:
#2010-01-29 43
#2010-02-16 48
To convert it into a list:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True).tolist()
Or:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))].tolist()
Both yield something like:
#[43, 48]
If you're dealing with a dataset that spans beyond one year, then you will need to group by both year and month. The following should help:
import numpy as np
import pandas as pd
z = ['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16', '2011-01-04', '2011-01-05',
'2011-01-06', '2011-01-07', '2011-01-08', '2011-01-11',
'2011-01-12', '2011-01-13', '2011-01-14', '2011-01-15',
'2011-01-19', '2011-01-20', '2011-01-21', '2011-01-22',
'2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
'2011-01-29', '2011-02-01', '2011-02-02', '2011-02-03',
'2011-02-04', '2011-02-05', '2011-02-08', '2011-02-09',
'2011-02-10', '2011-02-11', '2011-02-12', '2011-02-16']
dates1 = pd.Series(np.random.randint(100,size=60),index=pd.to_datetime(z))
This:
dates1.groupby((dates1.index.year, dates1.index.month)).apply(pd.Series.tail,1).reset_index(level=(0,1), drop=True)
Or:
dates1[dates1.groupby((dates1.index.year, dates1.index.month)).apply(lambda s: np.max(s.index))]
Both yield something like:
# 2010-01-29 66
# 2010-02-16 80
# 2011-01-29 13
# 2011-02-16 10
I hope this proves useful.
You can use groupby by month and apply last value of index:
print (dates.groupby(dates.index.month).apply(lambda x: x.index[-1]))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
Another solution:
print (dates.groupby(dates.index.month).apply(lambda x: x.index.max()))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
For list first convert to string by strftime:
print (dates.groupby(dates.index.month)
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-29', '2010-02-16']
If need values per last Month value use iloc:
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]))
1 55
2 48
dtype: int64
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]).tolist())
[55, 48]
EDIT:
For year and month need convert index to_period by months:
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(
['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2011-01-11', '2011-01-12', '2011-01-13',
'2012-01-14', '2012-01-15', '2012-01-19', '2012-01-20',
'2013-01-21', '2013-01-22', '2013-01-25', '2013-01-26',
'2013-01-27', '2013-01-28', '2013-01-29', '2013-02-01',
'2014-02-02', '2014-02-03', '2014-02-04', '2014-02-05',
'2015-02-08', '2015-02-09', '2015-02-10', '2015-02-11',
'2016-02-12', '2016-02-16']))
#print (dates)
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1]))
2010-01 2010-01-08
2011-01 2011-01-13
2012-01 2012-01-20
2013-01 2013-01-29
2013-02 2013-02-01
2014-02 2014-02-05
2015-02 2015-02-11
2016-02 2016-02-16
Freq: M, dtype: datetime64[ns]
print (dates.groupby(dates.index.to_period('m'))
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-08', '2011-01-13', '2012-01-20', '2013-01-29',
'2013-02-01', '2014-02-05', '2015-02-11', '2016-02-16']
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]))
2010-01 68
2011-01 96
2012-01 53
2013-01 4
2013-02 16
2014-02 18
2015-02 41
2016-02 90
Freq: M, dtype: int64
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]).tolist())
[68, 96, 53, 4, 16, 18, 41, 90]
EDIT1: If need convert period to end of month datetime:
df = dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1])
df.index = df.index.to_timestamp('m')
print (df)
2010-01-31 2010-01-08
2011-01-31 2011-01-13
2012-01-31 2012-01-20
2013-01-31 2013-01-29
2013-02-28 2013-02-01
2014-02-28 2014-02-05
2015-02-28 2015-02-11
2016-02-29 2016-02-16
dtype: datetime64[ns]