Pandas Series: Decrement DateTime by 100 Years - pandas

I have a pandas series as follows...
0 2039-03-16
1 2056-01-21
2 2051-11-18
3 2064-03-05
4 2048-06-05
Name: BIRTH, dtype: datetime64
It was created from string data as follows
s = data['BIRTH']
s = pd.to_datetime(s)
s
I want to convert all dates after year 2040 to 1940
I can do this for a single record as follows
s.iloc[0].replace(year=d.year-100)
but I really want to just run it over the whole series. I can't work it out. Help!??
PS - I know there's ways outside of pandas using Python's DT module but I'd like to learn how to do this within Pandas please

Using DateOffset is the obvious choice here:
df['date'] - pd.offsets.DateOffset(years=100)
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Assign it back:
df['date'] -= pd.offsets.DateOffset(years=100)
df
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
We have the offsets module to deal with non-fixed frequencies, it comes in handy in situations like these.
To fix your code, you'd have wanted to apply datetime.replace rowwise using apply (not recommended):
df['date'].apply(lambda x: x.replace(year=x.year-100))
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Or using a list comprehension,
df.assign(date=[x.replace(year=x.year-100) for x in df['date']])
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Neither of these handle NaT entries very well.

Related

Converting pandas valuecounts to a specific type

I am trying to dispaly the percentages of a particular dataframe column as the percentage of it's grand total. I do have a constraint of this being a specific data type (numpy float does not fly)
My code is quite simple
dict(df['marital_status'].value_counts().transform(lambda x: x/sum(x)))
I tried astype() and trying to cast the values within the transform function itself but no joy.
Instead your function use normalize=True in Series.value_counts, then for percentages multiple 100 and if need integers after round casting:
print (df)
marital_status
0 1
1 0
2 1
3 1
4 1
5 0
6 0
d = df['marital_status'].value_counts(normalize=True).mul(100).round().astype(int).to_dict()
print (d)
{1: 57, 0: 43}

Why is datetime64 converted to timedelta64 when converting into a YYYY-MM string [duplicate]

This question already has answers here:
datetime to string with series in pandas
(3 answers)
Closed 9 months ago.
This post was edited and submitted for review 9 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I want to convert time columns (dtype: datetime64[ns]) in a panda.DataFrame into strings representing the year and month only.
It works as expected if all values in the column are valid.
0 2019-4
1 2017-12
dtype: object
But with missing values (pandas.NaT) in the column the result confuses me.
0 -1 days +23:59:59.999979806
1 -1 days +23:59:59.999798288
2 NaT
dtype: timedelta64[ns]
Or with .unique() it is array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]').
What happens here seems that somehow the result becomes a timedelta64. But I don't understand why this happens. The question is why does this happen?
The complete example code:
#!/usr/bin/env pyhton3
import pandas as pd
import numpy as np
# series with missing values
series = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05')])
def year_month_string(cell):
"""Convert a datetime64 into string representation with
year and month only.
"""
if pd.isna(cell):
return pd.NaT
return '{}-{}'.format(cell.year, cell.month)
print(series.apply(year_month_string))
# 0 2019-4
# 1 2017-12
# dtype: object
# Series with a missing value
series_nat = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05'),
pd.NaT])
result = series_nat.apply(year_month_string)
print(result)
# 0 -1 days +23:59:59.999979806
# 1 -1 days +23:59:59.999798288
# 2 NaT
# dtype: timedelta64[ns]
print(result.unique())
# array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]')
Don't use a custom function, use strftime with %-m (the minus strips the leading zeros):
series_nat.dt.strftime('%Y-%-m')
output:
0 2019-4
1 2017-12
2 NaN
dtype: object
%m would keep the leading zeros:
series_nat.dt.strftime('%Y-%m')
output:
0 2019-04
1 2017-12
2 NaN
dtype: object

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Why I can't change the series format?

I have the following series I obtained from a read_html:
series:
1        417.951
2        621.710
3        164.042
4        189.963
5        555.123
6        213.494
7      2.873.093
I would like to remove the . in order to apply some function to the numbers in that column.
So the desired output would be:
series:
1        417951
2        621710
3        164042
4        189963
5        555123
6        213494
7       2873093
I have tried a replace recieving the same result:
df.replace('.','')
and turn the series to a dataframe to see if that was the problem but it keeps returning initial series.
You need assign output to Series and if necessary convert to int, but also is necessary escape . by \ and add parameter regex in Series.replace:
series = series.replace('\.','', regex=True)
print (series)
1 417951
2 621710
3 164042
4 189963
5 555123
6 213494
7 2873093
Name: a, dtype: object
series = series.replace('\.','', regex=True).astype(int)
print (series)
1 417951
2 621710
3 164042
4 189963
5 555123
6 213494
7 2873093
Name: a, dtype: int32
Another solution is use str.replace:
series = series.str.replace('.','')
print (series)
1 417951
2 621710
3 164042
4 189963
5 555123
6 213494
7 2873093
Name: a, dtype: object
But beter is use thousands parameter in read_html:
df = pd.read_html(url, thousands='.')

detecting jumps on pandas index dates

I managed to load historical data on data series on a large set of financial instruments, indexed by date.
I am plotting volume , price information without any issue.
What I want to achieve now is to determine if there is any big jump in dates, to see if I am missing large chunks of data.
The idea I had in mind was somehow to plot the difference in between two consecutive dates in the index and if the number is superior to 3 or 4 ( which is bigger than a week end and a bank holiday on a friday or monday ) then there is an issue.
Problem is I can figure out how do compute simply df[next day]-df[day], where df is indexed by day
You can use the shift Series method (note the DatetimeIndex method shifts by freq):
In [11]: rng = pd.DatetimeIndex(['20120101', '20120102', '20120106']) # DatetimeIndex like df.index
In [12]: s = pd.Series(rng) # df.index instead of rng
In [13]: s - s.shift()
Out[13]:
0 NaT
1 1 days, 00:00:00
2 4 days, 00:00:00
dtype: timedelta64[ns]
In [14]: s - s.shift() > pd.offsets.Day(3).nanos
Out[14]:
0 False
1 False
2 True
dtype: bool
Depending on what you want, perhaps you could either do any, or find the problematic values...
In [15]: (s - s.shift() > pd.offsets.Day(3).nanos).any()
Out[15]: True
In [16]: s[s - s.shift() > pd.offsets.Day(3).nanos]
Out[16]:
2 2012-01-06 00:00:00
dtype: datetime64[ns]
Or perhaps find the maximum jump (and where it is):
In [17]: (s - s.shift()).max() # it's weird this returns a Series...
Out[17]:
0 4 days, 00:00:00
dtype: timedelta64[ns]
In [18]: (s - s.shift()).idxmax()
Out[18]: 2
If you really wanted to plot this, simply plotting the difference would work:
(s - s.shift()).plot()