change multiple date time formats to single format in pandas dataframe - pandas

I have a DataFrame with multiple formats as shown below
0 07-04-2021
1 06-03-1991
2 12-10-2020
3 07/04/2021
4 05/12/1996
What I want is to have one format after applying the Pandas function to the entire column so that all the dates are in the format
date/month/year
What I tried is the following
date1 = pd.to_datetime(df['Date_Reported'], errors='coerce', format='%d/%m/%Y')
But it is not working out. Can this be done? Thank you

try with dayfirst=True:
date1=pd.to_datetime(df['Date_Reported'], errors='coerce',dayfirst=True)
output of date1:
0 2021-04-07
1 1991-03-06
2 2020-10-12
3 2021-04-07
4 1996-12-05
Name: Date_Reported, dtype: datetime64[ns]
If needed:
date1=date1.dt.strftime('%d/%m/%Y')
output of date1:
0 07/04/2021
1 06/03/1991
2 12/10/2020
3 07/04/2021
4 05/12/1996
Name: Date_Reported, dtype: object

Related

extracting hour and minutes from a cell in pandas column

Example
How can I split or extract 04:38 from 04:38:00 AM in a pandas dataframe column?
>>> df.timestamp
3 2020-01-17 07:02:20.540540416
2 2020-01-24 01:10:37.837837824
7 2020-03-14 21:58:55.135135232
Name: timestamp, dtype: datetime64[ns]
>>> df.timestamp.dt.strftime('%H:%M')
3 07:02
2 01:10
7 21:58
Name: timestamp, dtype: object
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html?highlight=dt%20strftime#pandas.Series.dt.strftime
Using str.slice:
df["hm"] = df["time"].str.slice(stop=5)

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

pandas to_datetime convert 6PM to 18

is there a nice way to convert Series data, represented like 1PM or 11AM to 13 and 11 accordingly with to_datetime or similar (other, than re)
data:
series
1PM
11AM
2PM
6PM
6AM
desired output:
series
13
11
14
18
6
pd.to_datetime(df['series']) gives the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 11:00:00
You can provide the format you want to use, with as format %I%p:
pd.to_datetime(df['series'], format='%I%p').dt.hour
The .dt.hour [pandas-doc] will thus obtain the hour for that timestamp. This gives us:
>>> df = pd.DataFrame({'series': ['1PM', '11AM', '2PM', '6PM', '6AM']})
>>> pd.to_datetime(df['series'], format='%I%p').dt.hour
0 13
1 11
2 14
3 18
4 6
Name: series, dtype: int64

get second largest value in row in selected columns in dataframe in pandas

I have a dataframe with subset of it shown below. There are more columns to the right and left of the ones I am showing you
M_cols 10D_MA 30D_MA 50D_MA 100D_MA 200D_MA Max Min 2nd smallest
68.58 70.89 69.37 **68.24** 64.41 70.89 64.41 68.24
**68.32**71.00 69.47 68.50 64.49 71.00 64.49 68.32
68.57 **68.40** 69.57 71.07 64.57 71.07 64.57 68.40
I can get the min (and max is easy as well) with the following code
df2['MIN'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].max(axis=1)
But how do I get the 2nd smallest. I tried this and got the following error
df2['2nd SMALLEST'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].nsmallest(2)
TypeError: nsmallest() missing 1 required positional argument: 'columns'
Seems like this should be a simple answer but I am stuck
For example you have following df
df=pd.DataFrame({'V1':[1,2,3],'V2':[3,2,1],'V3':[3,4,9]})
After pick up the value need to compare , we just need to sort value by axis=0(default)
sortdf=pd.DataFrame(np.sort(df[['V1','V2','V3']].values))
sortdf
Out[419]:
0 1 2
0 1 3 3
1 2 2 4
2 1 3 9
1st max:
sortdf.iloc[:,-1]
Out[421]:
0 3
1 4
2 9
Name: 2, dtype: int64
2nd max
sortdf.iloc[:,-2]
Out[422]:
0 3
1 2
2 3
Name: 1, dtype: int64

date can not be serialized

I am getting an error while trying to save the dataframe as a file.
from fastparquet import write
write('profile_dtl.parq', df)
The error is related to "date" and the error message looks like this...
ValueError: Can't infer object conversion type: 0 1990-01-01
1 1954-01-01
2 1981-11-15
3 1993-01-21
4 1948-01-01
5 1977-01-01
6 1968-04-28
7 1969-01-01
8 1989-01-01
9 1985-01-01
Name: dob, dtype: object
I have checked that the column is "object" just like any other column that can be serialized without any problem. If I remove the "dob" column from the dataframe, then this line will work. This will also work if there is date+time.
Only dates are not accepted by fast-parquet?
Try changing dob to datetime64 dtype:
import pandas as pd
dob = pd.Series(['1954-01-01', '1981-11-15', '1993-01-21', '1948-01-01',
'1977-01-01', '1968-04-28', '1969-01-01', '1989-01-01',
'1985-01-01'], name='dob')
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
Name: dob, dtype: object
Note the dtype that results:
pd.to_datetime(dob)
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
dtype: datetime64[ns]
Using this Series as an index in a DataFrame:
baz = list(range(9))
foo = pd.DataFrame(baz, index=pd.to_datetime(dob), columns=['dob'])
You should be able to save your Parquet file now.
from fastparquet import write
write('foo.parquet', foo)
$ls -l foo.parquet
-rw-r--r-- 1 moi admin 854 Oct 13 16:44 foo.parquet
Your dob Series has an object dtype and you left unchanged the object_encoding='infer' argument to fastparquet.write. So, from the docs:
"The special value 'infer' will cause the type to be guessed from the first ten non-null values."
Fastparquet does not try to infer a date value from what it expects to be one of bytes|utf8|json|bson|bool|int|float.