Converting dataframe object to date using to_datetime - pandas

I have a data set that looks like this:
date id
0 2014-01-01 11000929
1 2014-01-01 11000190
2 2014-01-01 11000216
3 2014-01-01 11000822
4 2014-01-01 11000971
5 2014-01-01 11000721
6 2014-01-01 11000970
7 2014-01-01 11000574
8 2014-01-01 11000967
9 2014-01-01 11000172
10 2014-01-01 11000208
11 2014-01-01 11000966
12 2014-01-01 11000344
13 2014-01-01 11000965
14 2014-01-01 11000935
15 2014-01-01 11000964
16 2014-01-01 11000741
17 2014-01-01 11000868
18 2014-01-01 11000035
19 2014-01-01 11000203
20 2014-01-02 11000574
as you can see there is a lot of duplciate date times for different products, I will merge this table with another table which requires me to convert date column, which is currently and object, to datetime64[ns].
I tried
df_date_id.date = pd.to_datetime(df_date_id.date)
but I end up having the error:
TypeError: <class 'pandas._libs.tslibs.period.Period'> is not convertible to datetime
p.s: the table I am going to merge with looks like this:
date id score
0 2014-01-01 11000035 75
1 2014-01-02 11000035 84
2 2014-01-03 11000035 55
so date format of both tables looks the same to me.
Thanks in advance.

I think is necessary convert period to datetimes with to_timestamp:
df['date'] = df['date'].dt.to_timestamp()
print (df['date'].dtypes)
datetime64[ns]
Another solution is convert column in another DataFrame to periods like:
df2['date'] = df2['date'].dt.to_period('d')

Works for me by specifying the format:
df.date = pd.to_datetime(df.date, format='%Y-%M-%d')
date id
0 2014-01-01 00:01:00 11000929
1 2014-01-01 00:01:00 11000190
2 2014-01-01 00:01:00 11000216
3 2014-01-01 00:01:00 11000822
4 2014-01-01 00:01:00 11000971
If not try:
df.date = pd.to_datetime(df.date.astype(str), format='%Y-%M-%d')

Related

Pandas groupby issue after melt bug?

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?
my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

Error in creating a new columns in pandas dataframe

Tried creating a new column to categorize different time frames into categories using np.select. However, python throws an error saying shape mismatch. I'm not sure how to get it corrected.
For your logic, it's simple to use hour attribute of datetime
import numpy as np
s = pd.Series(pd.date_range("1-Apr-2021", "now", freq="4H"), name="start_date")
(s.to_frame()
.join(pd.Series(np.select([s.dt.hour.between(1,6),
s.dt.hour.between(7,12)],
[1,2],0), name="cat"))
.head(8)
)
start_date
cat
0
2021-04-01 00:00:00
0
1
2021-04-01 04:00:00
1
2
2021-04-01 08:00:00
2
3
2021-04-01 12:00:00
2
4
2021-04-01 16:00:00
0
5
2021-04-01 20:00:00
0
6
2021-04-02 00:00:00
0
7
2021-04-02 04:00:00
1

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

Selecting data from one dataframe base on column in second dataframe

I have a dataframe (df), contains datetime columns startdate, enddate and volume of product
If I want to look at one particular date that fit in between startdate and enddate and its total volume, i can do it with no problem at all (see code).
However if I create a second dataframe (call it report), create a list of date that I would like to look at the total volume of product from first df, I came up with an error:
Can only compare identically-labeled Series objects
I read up on things like dropping index on the second df or sorting dates but they don't seem to work
So my working code for requesting volume fitted within startdate and enddate, say first of july 2019:
df[(df['StartDate'] >= '2019-07-01') & (df['EndDate'] <= '2019-10-31')]['Volume'].sum()
but if i create a second df (report):
report = pd.Series(pd.date_range('today', periods=len(df), freq='D').normalize(),name='Date')
report = pd.DataFrame(report)
and request what i want to see:
report['trial'] = df[(df['StartDate'] >= report.Date) & (df['EndDate'] <= report.Date)]['Volume'].sum()
got this error: 'Can only compare identically-labeled Series objects'
Any advice/suggestions welcome, thanks!
First, some sample data:
np.random.seed(42)
dates = pd.date_range('2019-01-01', '2019-12-01', freq='MS')
df = pd.DataFrame({
'StartDate': dates,
'EndDate': dates + pd.offsets.MonthEnd(),
'Volume': np.random.randint(1, 10, len(dates))
})
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
2 2019-03-01 2019-03-31 8
3 2019-04-01 2019-04-30 5
4 2019-05-01 2019-05-31 7
5 2019-06-01 2019-06-30 3
6 2019-07-01 2019-07-31 7
7 2019-08-01 2019-08-31 8
8 2019-09-01 2019-09-30 5
9 2019-10-01 2019-10-31 4
10 2019-11-01 2019-11-30 8
11 2019-12-01 2019-12-31 8
And the report dates:
reports = pd.to_datetime(['2019-01-15', '2019-02-15', '2019-08-15'])
Using numpy's array broadcasting:
start = df['StartDate'].values
end = df['EndDate'].values
d = reports.values[:, None]
df[np.any((start <= d) & (d <= end), axis=0)]
Result:
StartDate EndDate Volume
0 2019-01-01 2019-01-31 7
1 2019-02-01 2019-02-28 4
7 2019-08-01 2019-08-31 8

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT