Pandas Compare - How to compare 2 date columns in 2 separate dataframes - pandas

I have once csv with missing dates, I have created a new df of that same date range, without the missing dates. I want to compare the two csvs and place an NaN wherever there are blank dates in the original csv:
Example:
DateTime Measurement Dates
0 2016-10-09 00:00:00 1021.9 2016-10-09
1 2016-10-11 00:00:00 1019.9 2016-10-10
2 2016-10-12 00:00:00 1015.8 2016-10-11
3 2016-10-13 00:00:00 1013.2 2016-10-12
4 2016-10-14 00:00:00 1005.9 2016-10-13
so I want the new table to be:
DateTime Measurement Dates
0 2016-10-09 00:00:00 1021.9 2016-10-09
1 Nan 00:00:00 Nan 2016-10-10
2 2016-10-11 00:00:00 1015.8 2016-10-11
3 2016-10-12 00:00:00 1013.2 2016-10-12
4 2016-10-13 00:00:00 1005.9 2016-10-13
And then I will remove the DateTime column so the final df is a complete list of dates with the missing measurements.
The code I have used thus far:
new_dates = pandas.date_range(start = '2016-10-09 00:00:00', end = '2017-10-09 00:00:00')
merged = pandas.merge(measurements, updated_dates,left_index=True, right_index=True)

If I understand your correctly you want to resample your DateTime column to a daily frequency and fill the gaps with NaN:
# Use this line if your DateTime column is not datetime type yet
# df['DateTime'] = pd.to_datetime(df['DateTime'])
dates = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='D')
df = df.set_index('DateTime').reindex(dates).reset_index()
Output
index Measurement
0 2016-10-09 1021.9
1 2016-10-10 NaN
2 2016-10-11 1019.9
3 2016-10-12 1015.8
4 2016-10-13 1013.2
5 2016-10-14 1005.9
If you have unique dates, you can use resample as well. If your dates are not unique it would aggregate them and take the mean of two dates:
df.set_index('DateTime').resample('D').mean()
Output
DateTime Measurement
0 2016-10-09 1021.9
1 2016-10-10 NaN
2 2016-10-11 1019.9
3 2016-10-12 1015.8
4 2016-10-13 1013.2
5 2016-10-14 1005.9

Related

Panda DF Convert All Dates to YYYY-MM-DD format

i have data that looks like this stored in a DF and I'm trying to convert the "DATE" column so that all the dates are in the format of yyyy-mm-dd format instead of yyyy-dd-mm as you can see when the date changes by the "TIME" column to a new day (some of the dates not shown are already set to the YYYY-MM-DD format but I'm trying to change all of them to the YYYY-MM-DD format):
DATE TIME BAFFIN BAY GATUN II GATUN I KLONDIKE IIIG \
8778 2016-01-01 1900 8.926278 8.046583 7.649784 7.333993
8779 2016-01-01 2000 8.817666 4.395097 4.748931 6.672631
8780 2016-01-01 2100 8.704014 6.384826 7.128692 6.115349
8781 2016-01-01 2200 8.496358 8.261933 8.166153 6.242737
8782 2016-01-01 2300 8.434297 4.656991 5.894877 5.781445
8783 2016-02-01 0000 8.528372 3.056838 3.086056 5.023564
8784 2016-02-01 0100 8.783731 4.614589 4.894076 5.042875
8785 2016-02-01 0200 8.572500 3.860174 4.641366 5.174426
8786 2016-02-01 0300 8.279557 2.076971 2.644479 5.492729
8787 2016-02-01 0400 8.378920 3.562210 2.806703 5.356025
I'm trying to set it the "DATE" column to a datetime column with specifying the format but it does nothing:
df2['DATE'] = pd.to_datetime(df2['DATE'],format='%Y-%m-%d')
thank you in advance for your help!
Can you try this
pd.to_datetime(df['TIME'], dayfirst=True)
0 2016-01-01
1 2016-01-01
2 2016-01-01
3 2016-01-01
4 2016-01-01
5 2016-01-02
6 2016-01-02
7 2016-01-02
8 2016-01-02
9 2016-01-02
consider joining 'DATE' and 'TIME' to get a complete datetime column. Assuming both columns are of dtype obj (string), you can combine them using the + operator and then call pd.to_datetime with a specified format. Ex:
import pandas as pd
df = pd.DataFrame({'DATE': ['2016-01-01', '2016-02-01'],
'TIME': ['1900', '0000']})
df['DateTime'] = pd.to_datetime(df['DATE']+df['TIME'], format='%Y-%d-%m%H%M')
# df['DateTime']
# 0 2016-01-01 19:00:00
# 1 2016-01-02 00:00:00
# Name: DateTime, dtype: datetime64[ns]

pandas reindex fill in missing dates

I have a dataframe with an index of dates. Each data is the first of the month. I want to fill in all missing dates in the index at a daily level.
I thought this should work:
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df=df.reindex(daily)
But it's returning NA in rows that should have data in (1st of the month dates) Can anyone see the issue?
Use reindex with parameter method='ffill' or resample with ffill for more general solution, because is not necessary create new index by date_range:
df = pd.DataFrame({'a': range(13)},
index=pd.date_range('2016-01-01', '2017-01-01', freq='MS'))
print (df)
a
2016-01-01 0
2016-02-01 1
2016-03-01 2
2016-04-01 3
2016-05-01 4
2016-06-01 5
2016-07-01 6
2016-08-01 7
2016-09-01 8
2016-10-01 9
2016-11-01 10
2016-12-01 11
2017-01-01 12
daily=pd.date_range('2016-01-01', '2018-01-01', freq='D')
df1 = df.reindex(daily, method='ffill')
Another solution:
df1 = df.resample('D').ffill()
print (df1.head())
a
2016-01-01 0
2016-01-02 0
2016-01-03 0
2016-01-04 0
2016-01-05 0

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT

Pandas Set on copy warning when using .loc

I'm trying to change the values in a column of a dataframe based on a condition.
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 0.0 0
2012-07-01 01:30:00 0.005 0
2012-07-01 02:00:00 0.231 0
I want to set the 'gen' column to NaN whenever the sum of the 2 columns is below a threshold of 0.01, so what I want is this:
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 NaN 0
2012-07-01 01:30:00 NaN 0
2012-07-01 02:00:00 0.231 0
I have used this:
df.loc[df.gen + df.con <0.01 ,'gen'] = np.nan
It gives me the result I want but with the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I am confused because I am using .loc and I think I'm using it in the way suggested.
For me your solution works nice.
Alternative solution with mask, it by default add NaN if condition True:
df['gen'] = df['gen'].mask(df['gen'] + df['cont'] < 0.01)
print (df)
timestamp gen cont
0 2012-07-01 00:00:00 0.293 0
1 2012-07-01 00:30:00 0.315 0
2 2012-07-01 01:00:00 NaN 0
3 2012-07-01 01:30:00 NaN 0
4 2012-07-01 02:00:00 0.231 0
EDIT:
You need copy.
If you modify values in df later you will find that the modifications do not propagate back to the original data (df_in), and that Pandas does warning.
df = df_in.loc[sDate:eDate].copy()

Addition in SQL by month

I need to SUM column something by month:
date something
2010-01-02
2010-01-03
2010-01-04
2010-01-07
2010-01-10
2010-01-12
2010-01-13
2010-01-14
2010-01-15
2010-01-16
2010-01-17
2010-01-18 3
2010-01-19 1
2010-01-21
2010-01-22 11
2010-01-23 1
2010-01-24
2010-01-25
2010-01-26
2010-01-27
2010-01-28
2010-01-29
2010-01-30
2010-01-05 5
2010-01-06 8
2010-01-09
2010-01-08 3
2010-01-11
2010-01-01
2010-01-20 0
2010-01-31 13
Output should be e.g. for JAN 2010 SUM OF SOMETHING 45:
date something
2010-01 45
How to write SQL query for that?
This is a simple aggregation based on the month of the date column:
select to_char("date", 'yyyy-mm'), sum(something)
from the_table
group by to_char("date", 'yyyy-mm')
This assumes the column date has the data type date (or timestamp)