Pandas dataframe datetime timestamp from string - pandas

I am trying to convert a column in a pandas dataframe from a string to a timestamp.
Due to a slightly annoying constraint (I am limited by my employers software & IT policy) I am running an older version of Pandas (0.14.1). This version does include the "pd.Timestamp".
Essentially, I want to pass a dataframe column formatted as a string to "pd.Timestamp" to create a column of Timestamps. Here is an example dataframe
'Date/Time String' 'timestamp'
0 2017-01-01 01:02:03 NaN
1 2017-01-02 04:05:06 NaN
2 2017-01-03 07:08:09 NaN
My DataFrame is very big, so iterating through it is really inefficient. But this is what I came up with:
for i in range (len(df['Date/Time String'])):
df['timestamp'].iloc[i] = pd.Timestamp(df['Date/Time String'].iloc[i])
What would be the sensible way to make this operation much faster?

You can check this:
import pandas as pd
df['Date/Time Timestamp'] = pd.to_datetime(df['Date/Time String'])

Related

pd.read_csv - dates in pandas multiindex column names

I import a csv file into a pandas dataframe.
df=pd.read_csv('data.csv',index_col=[0],header=[0,1])
My data has a column multiindex with two levels. Level(0) contains strings and level(1) contains dates.
By default, these dates become strings when imported.
I would like to convert level(1) column names to date format either when I import the data (but I cannot figure out the right way to do that when reading the documentation) or subsequently, if not possible during the import phase.
However, if I do:
cdq.columns.levels[1]=cdq.columns.levels[1].astype('datetime64[ns]')
I get the message that 'FrozenList' does not support mutable operations.
Is there a way to do that?
d = {'Ticker':['ABBN.SW','ABBN.SW','ABBN.SW','ABBN.SW'],
'Date':['31/12/2021 00:00','30/09/2021 00:00','30/06/2021 00:00','31/03/2021 00:00'],
'investments':[-480000000,251000000,892000000,162000000],
'changeToLiabilities':[298000000,52000000,267000000,42000000]}
pd.DataFrame(d)
Ticker Date investments changeToLiabilities
0 ABBN.SW 31/12/2021 00:00 -480000000 298000000
1 ABBN.SW 30/09/2021 00:00 251000000 52000000
2 ABBN.SW 30/06/2021 00:00 892000000 267000000
3 ABBN.SW 31/03/2021 00:00 162000000 42000000

To prevent automatic type change in Pandas

I have a excel (.xslx) file with 4 columns:
pmid (int)
gene (string)
disease (string)
label (string)
I attempt to load this directly into python with pandas.read_excel
df = pd.read_excel(path, parse_dates=False)
capture from excel
capture from pandas using my ide debugger
As shown above, pandas tries to be smart, automatically converting some of gene fields such as 3.Oct, 4.Oct to a datetime type. The issue is that 3.Oct or 4.Oct is a abbreviation of Gene type and totally different meaning. so I don't want pandas to do so. How can I prevent pandas from converting types automatically?
Update:
In fact, there is no conversion. The value appears as 2020-10-03 00:00:00 in Pandas because it is the real value stored in the cell. Excel show this value in another format
Update 2:
To keep the same format as Excel, you can use pd.to_datetime and a custom function to reformat the date.
# Sample
>>> df
gene
0 PDGFRA
1 2021-10-03 00:00:00 # Want: 3.Oct
2 2021-10-04 00:00:00 # Want: 4.Oct
>>> df['gene'] = (pd.to_datetime(df['gene'], errors='coerce')
.apply(lambda dt: f"{dt.day}.{calendar.month_abbr[dt.month]}"
if dt is not pd.NaT else np.NaN)
.fillna(df['gene']))
>>> df
gene
0 PDGFRA
1 3.Oct
2 4.Oct
Old answer
Force dtype=str to prevent Pandas try to transform your dataframe
df = pd.read_excel(path, dtype=str)
Or use converters={'colX': str, ...} to map the dtype for each columns.
pd.read_excel has a dtype argument you can use to specify data types explicitly.

Change NaN to None in Pandas dataframe

I try to replace Nan to None in pandas dataframe. It was working to use df.where(df.notnull(),None).
Here is the thread for this method.
Use None instead of np.nan for null values in pandas DataFrame
When I try to use the same method on another dataframe, it failed.
The new dataframe is like below
A NaN B C D E, the print out of the dataframe is like this:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 A NaN B C D E
even when I use the working code run against the new dataframe, it failed.
I just wondering is it is because in the excel, the cell format has to be certain type.
Any suggestion on this?
This always works for me
df = df.replace({np.nan:None})
You can check this related question, Credit from here
The problem is that I did not follow the format.
The format I used that cause the problem was
df.where(df.notnull(), None)
If I wrote the code like this, there is no problem
df = df.where(df.notnull(), None)
To do it just over one column
df.col_name.replace({np.nan: None}, inplace=True)
This is not as easy as it looks.
1.NaN is the value set for any cell that is empty when we are reading file using pandas.read_csv()
2.None is the value set for any cell that is NULL when we are reading file using pandas.read_sql() or readin from a database
import pandas as pd
import numpy as np
x=pd.DataFrame()
df=pd.read_csv('file.csv')
df=df.replace({np.NaN:None})
df['prog']=df['prog'].astype(str)
print(df)
if there is compatibility issue of datatype , which will be because on replacing np.NaN will make the column of dataframe as object type.
so in this case first replace np.NaN with None and then choose the required datatype for the column
file.csv
column names : batch,prog,name
'prog' column is empty

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

subset a data frame based on date range [duplicate]

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.
What is the best way to achieve this?
If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.
For example:
df.loc['2014-01-01':'2014-02-01']
See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
If the column is not the index you have two choices:
Make it the index (either temporarily or permanently if it's time-series data)
df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]
See here for the general explanation
Note: .ix is deprecated.
Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:
import datetime
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
And if your dates are standardized by importing datetime package, you can simply use:
df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]
For standarding your date string using datetime package, you can use this function:
import datetime
datetime.datetime.strptime
If you have already converted the string to a date format using pd.to_datetime you can just use:
df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]
The shortest way to filter your dataframe by date:
Lets suppose your date column is type of datetime64[ns]
# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']
# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']
# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:
from datetime import date
import pandas as pd
value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
If the dates are in the index then simply:
df['20160101':'20160301']
You can use pd.Timestamp to perform a query and a local reference
import pandas as pd
import numpy as np
df = pd.DataFrame()
ts = pd.Timestamp
df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')
print(df)
print(df.query('date > #ts("20190515T071320")')
with the output
date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing # prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string
So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().
If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:
import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost']
This has been tested working for Python 3.7. Hope you will find this useful.
I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.
If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:
df.loc[df.index.month == 3]
That will filter the dataset for you by March.
How about using pyjanitor
It has cool features.
After pip install pyjanitor
import janitor
df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
You could just select the time range by doing: df.loc['start_date':'end_date']
In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case
df.loc['2021-08-01':'2021-08-31']
returned empty. Whereas
df.loc['2021-08-31':'2021-08-01']
returned the expected data.
Another solution if you would like to use the .query() method.
It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())
df = pd.DataFrame({
'MyValue': [1,2,3],
'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")
(following the comment from #Phillip Cloud, answer from #Retozi)
import the pandas library
import pandas as pd
STEP 1: convert the date column into a string using the pd.to_datetime() method
df['date']=pd.to_datetime(df["date"],unit='s')
STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)
df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
STEP 3 : Check the output
print(df)
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]