Why list of pd.Interval doesn't recognized by DataFrame automatically? - pandas

intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
pd.DataFrame({'d':intervals}).dtypes
Produces dtype as Object not Interval:
>>> d object
>>> dtype: object
But at the same time list of, for example, DateTimes is recognized on the fly:
datetimes = [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]
pd.DataFrame({'d':datetimes}).dtypes
>>> d datetime64[ns]
>>> dtype: object
Is situation with intervals somewhat like with list of strings - default type of column in the DataFrame will be object as well, because DataFrame doesn't 'know' if we want to treat this column as objects (for dumping to disk, ..), or as strings (for concatenation, ..) or even as elements of category type? If so - what different use cases with intervals may be? If not what is the case here?

This is a bug in pandas: https://github.com/pandas-dev/pandas/issues/23563
For now, the cleanest workaround is to wrap the list with pd.array:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
In [3]: pd.DataFrame({'d': pd.array(intervals)}).dtypes
Out[3]:
d interval[float64]
dtype: object

Related

Pandas: Series and DataFrames handling of datetime objects

Pandas behaves in an unusual way when interacting with datetime information when the data type is a Series vs when it is not. Specifically, either .dt is required (if it's a Series) or .dt will throw an error (if it's not a Series.) I've spent the better part of an hour tracking the behavior down.
import pandas as pd
data = {'dates':['2019-03-01','2019-03-02'],'event':[0,1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
Pandas Series:
df['dates'][0:1].dt.year
>>>
0 2019
Name: dates, dtype: int64
df['dates'][0:1].year
>>>
AttributeError: 'Series' object has no attribute 'year'
Not Pandas Series:
df['dates'][0].year
>>>
2019
df['dates'][0].dt.year
>>>
AttributeError: 'Timestamp' object has no attribute 'dt'
Does anyone know why Pandas behaves this way? Is this a "feature not a bug" like it's actually useful in setting?
This behaviour is consistent with python. A collection of datetimes is fundamentally different than a single datetime.
We can see this simply with list vs datetime object:
from datetime import datetime
a = datetime.now()
print(a.year)
# 2021
list_of_datetimes = [datetime.now(), datetime.now()]
print(list_of_datetimes.year)
# AttributeError: 'list' object has no attribute 'year'
Naturally a list does not have a year attribute, because in python we cannot guarantee the list contains only datetimes.
We would have to apply some function to each element in the list to access the year, for example:
from datetime import datetime
list_of_datetimes = [datetime.now(), datetime.now()]
print(*map(lambda d: d.year, list_of_datetimes))
# 2021 2021
This concept of "applying an operation over a collection of datetimes" is fundamentally what the dt accessor does. By extension, this accessor is unnecessary when affecting a single element as it is when working with only a single datetime.
In pandas we can only use the dt accessor with DateTime Series.
There are a lot of guarantees needed to be made in order to apply the year to all elements in the Series:
import pandas as pd
data = {'dates': ['2019-03-01', '2019-03-02'], 'event': [0, 1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
print(df['dates'].dt.year)
0 2019
1 2019
Name: dates, dtype: int64
Again, however, since a column of object type could contain both datetimes and non-datetimes we may need to access the individual elements. Like:
import pandas as pd
data = {'dates': ['2019-03-01', 87], 'event': [0, 1]}
df = pd.DataFrame(data)
print(df)
# dates event
# 0 2019-03-01 0
# 1 87 1
# Convert only 1 value to datetime
df.loc[0, 'dates'] = pd.to_datetime(df.loc[0, 'dates'])
print(df.loc[0, 'dates'].year)
# 2019
print(df.loc[1, 'dates'].year)
# AttributeError: 'int' object has no attribute 'year'

How would I access individual elements of this pandas dataFrame?

How would I access the individual elements of the dataFrame below?
More specifically, how would I retrieve/extract the string "CN112396173" in the index 2 of the dataFrame?
Thanks
A more accurate description of your problem would be: "Getting all first words of string column in pandas"
You can use data["PN"].str.split(expand=True)[0]. See the docs.
>>> import pandas as pd
>>> df = pd.DataFrame({"column": ["asdf abc cdf"]})
>>> series = df["column"].str.split(expand=True)[0]
>>> series
0 asdf
Name: 0, dtype: object
>>> series.to_list()
["asdf"]
dtype: object is actually normal (in pandas, strings are 'objects').

How to get a list of dtypes from a numpy structured array?

How can I get a list of dtypes from a numpy structured array?
Create example structured array:
arr = np.array([[1.0, 2.0],[3.0, 4.0]])
dt = {'names':['ID', 'Ring'], 'formats':[np.double, np.double]}
arr.dtype = dt
>>> arr
array([[(1., 2.)],
[(3., 4.)]], dtype=[('ID', '<f8'), ('Ring', '<f8')])
On one hand, it's easy to isolate the column names.
>>> arr.dtype.names
('ID', 'RING')
However, ironically, none of the dtype attributes seem to reveal the individual dtypes.
Discovered that, despite not having dictionary methods like .items(), you can still call dtype['<column_name>'].
column_names = list(arr.dtype.names)
dtypes = [str(arr.dtype[n]) for n in column_names]
>>> dtypes
['float64', 'float64']
Or, as #hpaulj hinted, in one step:
>>> [str(v[0]) for v in arr.dtype.fields.values()]
['float64', 'float64']

Python Pandas : Read dates from excel in different formats [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

How to generate pandas DataFrame column of Categorical from string column?

I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association?
What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?)
I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
The only workaround for pandas pre-0.15 I found is as follows:
column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
The labels<->levels is stored in the index object.
To convert an integer array to string array: index[integer_array]
To convert a string array to integer array: index.get_indexer(string_array)
Here is some exampe:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])