Reformatting dataframe column issue [duplicate] - pandas

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to Pandas dtype 'object'.
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: datetime64[ns].
Now I want to convert this date format to 01/26/2016 or any other general date format. How do I do it?
(Whatever the method I try, it always shows the date in 2016-01-26 format.)

You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
DOB
0 26/1/2016
1 26/1/2016
df['DOB'] = pd.to_datetime(df.DOB)
print (df)
DOB
0 2016-01-26
1 2016-01-26
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
DOB DOB1
0 2016-01-26 01/26/2016
1 2016-01-26 01/26/2016

Changing the format but not changing the type:
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))

There is a difference between
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don't change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of this answer.
I will suppose that your column DOB already has the datetime64 type (you have shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
Styling it as mm/dd/yyyy:
df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
DOB
0 07/03/2019
1 08/03/2019
2 09/03/2019
3 10/03/2019
Styling it as dd-mm-yyyy:
df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")})
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don't assign it back to df:
Don't do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don't do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don't need the Jupyter notebook for styling (i.e., for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g., for publishing your formatted dataframe on the Web, or simply present your table in the HTML format):
df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()

Compared to the first answer, I will recommend to use dt.strftime() first, and then pd.to_datetime(). In this way, it will still result in the datetime data type.
For example,
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)
df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)

The below code worked for me instead of the previous one:
df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')

You can try this. It'll convert the date format to DD-MM-YYYY:
df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)

The below code changes to the 'datetime' type and also formats in the given format string.
df['DOB'] = pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))

Below is the code that worked for me. And we need to be very careful for format. The below link will be definitely useful for knowing your exiting format and changing into the desired format (follow the strftime() and strptime() format codes in strftime() and strptime() Behavior):
data['date_new_format'] = pd.to_datetime(data['date_to_be_changed'] , format='%b-%y')

Related

To prevent automatic type change in Pandas

I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Converting Date Time index in Pandas [duplicate]

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to Pandas dtype 'object'.
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: datetime64[ns].
Now I want to convert this date format to 01/26/2016 or any other general date format. How do I do it?
(Whatever the method I try, it always shows the date in 2016-01-26 format.)
You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
DOB
0 26/1/2016
1 26/1/2016
df['DOB'] = pd.to_datetime(df.DOB)
print (df)
DOB
0 2016-01-26
1 2016-01-26
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
DOB DOB1
0 2016-01-26 01/26/2016
1 2016-01-26 01/26/2016
Changing the format but not changing the type:
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))
There is a difference between
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don't change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of this answer.
I will suppose that your column DOB already has the datetime64 type (you have shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
Styling it as mm/dd/yyyy:
df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
DOB
0 07/03/2019
1 08/03/2019
2 09/03/2019
3 10/03/2019
Styling it as dd-mm-yyyy:
df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")})
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don't assign it back to df:
Don't do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don't do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don't need the Jupyter notebook for styling (i.e., for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g., for publishing your formatted dataframe on the Web, or simply present your table in the HTML format):
df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()
Compared to the first answer, I will recommend to use dt.strftime() first, and then pd.to_datetime(). In this way, it will still result in the datetime data type.
For example,
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)
df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)
The below code worked for me instead of the previous one:
df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')
You can try this. It'll convert the date format to DD-MM-YYYY:
df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)
The below code changes to the 'datetime' type and also formats in the given format string.
df['DOB'] = pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))
Below is the code that worked for me. And we need to be very careful for format. The below link will be definitely useful for knowing your exiting format and changing into the desired format (follow the strftime() and strptime() format codes in strftime() and strptime() Behavior):
data['date_new_format'] = pd.to_datetime(data['date_to_be_changed'] , format='%b-%y')

pandas to_datetime leaves unconverted data

I'm trying to convert a column with strings that looks like "201905011" (year/month/day) to datetime, ideally showing as 05-01-2019 (month/day/year). I'm currently trying to following but it's not working for me.
pd.to_datetime(data.datetime, format = '%Y%m%d%H')
This leaves me with the error: "ValueError: unconverted data remains: 4"
I would like to know instead how to correctly do this.
I created an example based on ALollz comment. I have created a dataframe in which first row is correct and second row has and extra 0 in the end. If you use the method, it will return the rows in which the data doesn't match the specified format.
import pandas as pd
df = pd.DataFrame({"datefield":["201901010","20190101010"]})
df.loc[pd.to_datetime(df.datefield, format='%Y%m%d%H', errors='coerce').isnull(), 'datefield']
1 20190101010
Name: datefield, dtype: object

Datetime column coerced to int when setting with .loc and slice

I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.
The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.
[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.