extracting hour and minutes from a cell in pandas column - pandas

Example
How can I split or extract 04:38 from 04:38:00 AM in a pandas dataframe column?

>>> df.timestamp
3 2020-01-17 07:02:20.540540416
2 2020-01-24 01:10:37.837837824
7 2020-03-14 21:58:55.135135232
Name: timestamp, dtype: datetime64[ns]
>>> df.timestamp.dt.strftime('%H:%M')
3 07:02
2 01:10
7 21:58
Name: timestamp, dtype: object
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html?highlight=dt%20strftime#pandas.Series.dt.strftime

Using str.slice:
df["hm"] = df["time"].str.slice(stop=5)

Related

series from dictionary using pandas

#create series from dictionary using pandas
data_dict={'Ahmed':90,'Ali':85,'Omar':80}
series=pd.Series(data_dict,index=['Ahmed','Ali','Omar'])
print("Series :",series)
series2=pd.Series(data_dict,index=['Ahmed','Ali','Omar','Karthi'])
print("Series 2 :",series2)
I tried this code while practising pandas, I received the output as below:
Series :
Ahmed 90
Ali 85
Omar 80
dtype: int64
Series 2 :
Ahmed 90.0
Ali 85.0
Omar 80.0
Karthi NaN
dtype: float64
Question: Why the data type got changed in the Series 2 from int to float?
I just tried to know what will be the output if i add an extra field in the index which is not belong to dictionary.I got NaN, but datatype got changed from int to float.
When providing a dictionary to pandas.Series, the keys are used as index, and the values as data.
In fact you only need:
series = pd.Series(data_dict)
print(series)
Ahmed 90
Ali 85
Omar 80
dtype: int64
If you use a list as source of the data, then index is useful:
series = pd.Series([90, 85, 80], index=['Ahmed','Ali','Omar'])
print(series)
Ahmed 90
Ali 85
Omar 80
dtype: int64
When you provide both, this acts as a reindex:
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar','Karthi'])
# equivalent to
series = pd.Series(data_dict).reindex(['Ahmed','Ali','Omar','Karthi'])
print(series)
Ahmed 90.0
Ali 85.0
Omar 80.0
Karthi NaN
dtype: float64
In this case, missing indices are filled with NaN as default value, which forces the float64 type.
You can prevent the change by using the Int64 dtype that supports an integer NA:
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar','Karthi'], dtype='Int64')
print(series)
Output:
Ahmed 90
Ali 85
Omar 80
Karthi <NA>
dtype: Int64
NaN is considered a special floating point value (IEE 754). There is no value for Karthi in series2, so it gets automatically filled in with NaN. Try converting one of the integers into np.NaN and you will see the same behavior. A series that contains a floating point will be automatically cast as a floating point.
import pandas as pd
import numpy as np
data_dict = {'Ahmed':90, 'Ali':85, 'Omar':np.NaN}
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar'])
Output:
Ahmed 90.0
Ali 85.0
Omar NaN
dtype: float64

Python: Mixed date format in data frame column

I have a dataframe with mixed date formats across and within columns. When trying to convert them from object to datetime type, I get an error due to column date1 having a mixed format. I can't see how to fix it in this case. Also, how could I remove the seconds from both columns (date1 and date2)?
Here's the code I attempted:
df = pd.DataFrame(np.array([[10, "2021-06-13 12:08:52.311 UTC", "2021-03-29 12:44:33.468"],
[36, "2019-12-07 12:18:02 UTC", "2011-10-15 10:14:32.118"]
]),
columns=['col1', 'date1', 'date2'])
df
>>
col1 date1 date2
0 10 2021-06-13 12:08:52.311 UTC 2021-03-29 12:44:33.468
1 36 2019-12-07 12:18:02 UTC 2011-10-15 10:14:32.118
# Converting from object to datetime
df["date1"]= pd.to_datetime(df["date1"], format="%Y-%m-%d %H:%M:%S.%f UTC")
df["date2"]= pd.to_datetime(df["date2"], format="%Y-%m-%d %H:%M:%S.%f")
>>
ValueError: time data '2019-12-07 12:18:02 UTC' does not match format '%Y-%m-%d %H:%M:%S.%f UTC' (match)
for conversion to datetime, i found the infer_datetime_format to be helpful.
could not get it to work on the complete dataframe, it is able to convert one column at a time.
In [19]: pd.to_datetime(df["date1"], infer_datetime_format=True)
Out[19]:
0 2021-06-13 12:08:52.311000+00:00
1 2019-12-07 12:18:02+00:00
Name: date1, dtype: datetime64[ns, UTC]
In [20]: pd.to_datetime(df["date2"], infer_datetime_format=True)
Out[20]:
0 2021-03-29 12:44:33.468
1 2011-10-15 10:14:32.118
Name: date2, dtype: datetime64[ns]
If atleast all formats start with this format "%Y-%m-%d %H:%M" , then you can just slice all strings till that point and use them
In [32]: df['date1'].str.slice(stop=16)
Out[32]:
0 2021-06-13 12:08
1 2019-12-07 12:18
Name: date1, dtype: object
for getting rid of the seconds in your datetime values, instead of simply getting rid of those values, you can use round , you can also check floor and ceil whatever suits your use case better.
In [28]: pd.to_datetime(df["date1"], infer_datetime_format=True).dt.round('T')
Out[28]:
0 2021-06-13 12:09:00+00:00
1 2019-12-07 12:18:00+00:00
Name: date1, dtype: datetime64[ns, UTC]
In [29]: pd.to_datetime(df["date2"], infer_datetime_format=True).dt.round('T')
Out[29]:
0 2021-03-29 12:45:00
1 2011-10-15 10:15:00
Name: date2, dtype: datetime64[ns]

change multiple date time formats to single format in pandas dataframe

I have a DataFrame with multiple formats as shown below
0 07-04-2021
1 06-03-1991
2 12-10-2020
3 07/04/2021
4 05/12/1996
What I want is to have one format after applying the Pandas function to the entire column so that all the dates are in the format
date/month/year
What I tried is the following
date1 = pd.to_datetime(df['Date_Reported'], errors='coerce', format='%d/%m/%Y')
But it is not working out. Can this be done? Thank you
try with dayfirst=True:
date1=pd.to_datetime(df['Date_Reported'], errors='coerce',dayfirst=True)
output of date1:
0 2021-04-07
1 1991-03-06
2 2020-10-12
3 2021-04-07
4 1996-12-05
Name: Date_Reported, dtype: datetime64[ns]
If needed:
date1=date1.dt.strftime('%d/%m/%Y')
output of date1:
0 07/04/2021
1 06/03/1991
2 12/10/2020
3 07/04/2021
4 05/12/1996
Name: Date_Reported, dtype: object

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

date can not be serialized

I am getting an error while trying to save the dataframe as a file.
from fastparquet import write
write('profile_dtl.parq', df)
The error is related to "date" and the error message looks like this...
ValueError: Can't infer object conversion type: 0 1990-01-01
1 1954-01-01
2 1981-11-15
3 1993-01-21
4 1948-01-01
5 1977-01-01
6 1968-04-28
7 1969-01-01
8 1989-01-01
9 1985-01-01
Name: dob, dtype: object
I have checked that the column is "object" just like any other column that can be serialized without any problem. If I remove the "dob" column from the dataframe, then this line will work. This will also work if there is date+time.
Only dates are not accepted by fast-parquet?
Try changing dob to datetime64 dtype:
import pandas as pd
dob = pd.Series(['1954-01-01', '1981-11-15', '1993-01-21', '1948-01-01',
'1977-01-01', '1968-04-28', '1969-01-01', '1989-01-01',
'1985-01-01'], name='dob')
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
Name: dob, dtype: object
Note the dtype that results:
pd.to_datetime(dob)
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
dtype: datetime64[ns]
Using this Series as an index in a DataFrame:
baz = list(range(9))
foo = pd.DataFrame(baz, index=pd.to_datetime(dob), columns=['dob'])
You should be able to save your Parquet file now.
from fastparquet import write
write('foo.parquet', foo)
$ls -l foo.parquet
-rw-r--r-- 1 moi admin 854 Oct 13 16:44 foo.parquet
Your dob Series has an object dtype and you left unchanged the object_encoding='infer' argument to fastparquet.write. So, from the docs:
"The special value 'infer' will cause the type to be guessed from the first ten non-null values."
Fastparquet does not try to infer a date value from what it expects to be one of bytes|utf8|json|bson|bool|int|float.