I have a pandas dataframe like the following
Year Month Day Securtiy Trade Value NewDate
2011 1 10 AAPL Buy 1500 0
My question is, how can I merge the columns Year, Month, Day into column NewDate
so that the newDate column looks like the following
2011-1-10
The best way is to parse it when reading as csv:
In [1]: df = pd.read_csv('foo.csv', sep='\s+', parse_dates=[['Year', 'Month', 'Day']])
In [2]: df
Out[2]:
Year_Month_Day Securtiy Trade Value NewDate
0 2011-01-10 00:00:00 AAPL Buy 1500 0
You can do this without the header, by defining column names while reading:
pd.read_csv(input_file, header=['Year', 'Month', 'Day', 'Security','Trade', 'Value' ], parse_dates=[['Year', 'Month', 'Day']])
If it's already in your DataFrame, you could use an apply:
In [11]: df['Date'] = df.apply(lambda s: pd.Timestamp('%s-%s-%s' % (s['Year'], s['Month'], s['Day'])), 1)
In [12]: df
Out[12]:
Year Month Day Securtiy Trade Value NewDate Date
0 2011 1 10 AAPL Buy 1500 0 2011-01-10 00:00:00
df['Year'] + '-' + df['Month'] + '-' + df['Date']
You can create a new Timestamp as follows:
df['newDate'] = df.apply(lambda x: pd.Timestamp('{0}-{1}-{2}'
.format(x.Year, x.Month, x.Day),
axix=1)
>>> df
Year Month Day Securtiy Trade Value NewDate newDate
0 2011 1 10 AAPL Buy 1500 0 2011-01-10
Related
import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627
IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0
Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.
I want to categorize data by month column
e.g.
date Month
2009-05-01==>May
I want to check outcomes by monthly
In this table I am excluding years and only want to keep months.
This is simple when using pd.Series.dt.month_name (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.month_name.html):
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2000-01-01', '2010-01-01', freq='1M')
})
df['month'] = df.date.dt.month_name()
df.head()
Output
date month
0 2000-01-31 January
1 2000-02-29 February
2 2000-03-31 March
3 2000-04-30 April
4 2000-05-31 May
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494
I have a column of timedelta in pandas. It is in the format x days 00:00:00. I want to filter out and flag the rows which have a value >=30 minutes. I have no clue how to do that using pandas. I tried booleans and if statements but it didn't work. Any help would be appreciated.
You can convert timedeltas to seconds by total_seconds and compare with scalar:
df = df[df['col'].dt.total_seconds() < 30]
Or compare with Timedelta:
df = df[df['col'] < pd.Timedelta(30, unit='s')]
Sample:
df = pd.DataFrame({'col':pd.to_timedelta(['25:10:01','00:01:20','00:00:20'])})
print (df)
col
0 1 days 01:10:01
1 0 days 00:01:20
2 0 days 00:00:20
df = df[df['col'].dt.total_seconds() < 30]
print (df)
col
2 00:00:20
what's the most pandas-appropriate way of achieving this? I want to create a column with datetime objects from the 'year','month' and 'day' columns, but all I came up with is some code that looks way too cumbersome:
myList=[]
for row in df_orders.iterrows(): #df_orders is the dataframe
myList.append(dt.datetime(row[1][0],row[1][1],row[1][2]))
#-->year, month and day are the 0th,1st and 2nd columns.
mySeries=pd.Series(myList,index=df_orders.index)
df_orders['myDateFormat']=mySeries
thanks a lot for any help.
Try this:
In [1]: df = pd.DataFrame(dict(yyyy=[2000, 2000, 2000, 2000],
mm=[1, 2, 3, 4], day=[1, 1, 1, 1]))
Convert to an integer:
In [2]: df['date'] = df['yyyy'] * 10000 + df['mm'] * 100 + df['day']
Convert to a string, then a datetime (as pd.to_datetime will interpret the integer differently):
In [3]: df['date'] = pd.to_datetime(df['date'].apply(str))
In [4]: df
Out[4]:
day mm yyyy date
0 1 1 2000 2000-01-01 00:00:00
1 1 2 2000 2000-02-01 00:00:00
2 1 3 2000 2000-03-01 00:00:00
3 1 4 2000 2000-04-01 00:00:00