How to change the a specific value of a row of a pandas Series? - pandas

I have the following pandas Series:
trade dtype
trade_action category
execution_venue object
from_implied int64
In the last row I would like to change the from_implied name to implied. How can I do this?
Expected output:
trade dtype
trade_action category
execution_venue object
implied int64

Here's what you can do:
ser = pd.Series(["trade","trade_Action","execution_venue","from_implied"])
ser2 = ser.replace(to_replace = "from_implied", value = "implied")
With ser.replace you can change the values of a pd.Series as above.

Assuming that the first column is your Series' index, you can use the pd.Series.rename method on your series:
import pandas as pd
# Your series here
series = pd.read_clipboard().set_index("trade")["dtype"]
out = series.rename({"from_implied": "implied"})
out:
trade
trade_action category
execution_venue object
implied int64
Name: dtype, dtype: object

Related

pandas loc does not preserve data type

Pandas loc indexing does not preserve the datatype of subarrays. Consider the following code:
import pandas as pd
s = pd.Series([1,2,"hi","bye"])
print(s) # dtype: object
print(s.loc[[0]]) # dtype: object
print(type(s.loc[0])) # <class 'int'>
I would like s.loc[[0]] to return a series with type int, rather than obj as it currently does.
You can use the astype(original data type).astype(your prefered data type) in the print clause, e.g. from your case:
import pandas as pd
s = pd.Series([1,2,"hi","bye"])
print(s)
print(s.loc[[0]].astype(str).astype(int))
result :
0 1
dtype: int32
Here is my answer, hope it be useful

Pandas: Series and DataFrames handling of datetime objects

Pandas behaves in an unusual way when interacting with datetime information when the data type is a Series vs when it is not. Specifically, either .dt is required (if it's a Series) or .dt will throw an error (if it's not a Series.) I've spent the better part of an hour tracking the behavior down.
import pandas as pd
data = {'dates':['2019-03-01','2019-03-02'],'event':[0,1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
Pandas Series:
df['dates'][0:1].dt.year
>>>
0 2019
Name: dates, dtype: int64
df['dates'][0:1].year
>>>
AttributeError: 'Series' object has no attribute 'year'
Not Pandas Series:
df['dates'][0].year
>>>
2019
df['dates'][0].dt.year
>>>
AttributeError: 'Timestamp' object has no attribute 'dt'
Does anyone know why Pandas behaves this way? Is this a "feature not a bug" like it's actually useful in setting?
This behaviour is consistent with python. A collection of datetimes is fundamentally different than a single datetime.
We can see this simply with list vs datetime object:
from datetime import datetime
a = datetime.now()
print(a.year)
# 2021
list_of_datetimes = [datetime.now(), datetime.now()]
print(list_of_datetimes.year)
# AttributeError: 'list' object has no attribute 'year'
Naturally a list does not have a year attribute, because in python we cannot guarantee the list contains only datetimes.
We would have to apply some function to each element in the list to access the year, for example:
from datetime import datetime
list_of_datetimes = [datetime.now(), datetime.now()]
print(*map(lambda d: d.year, list_of_datetimes))
# 2021 2021
This concept of "applying an operation over a collection of datetimes" is fundamentally what the dt accessor does. By extension, this accessor is unnecessary when affecting a single element as it is when working with only a single datetime.
In pandas we can only use the dt accessor with DateTime Series.
There are a lot of guarantees needed to be made in order to apply the year to all elements in the Series:
import pandas as pd
data = {'dates': ['2019-03-01', '2019-03-02'], 'event': [0, 1]}
df = pd.DataFrame(data)
df['dates'] = pd.to_datetime(df['dates'])
print(df['dates'].dt.year)
0 2019
1 2019
Name: dates, dtype: int64
Again, however, since a column of object type could contain both datetimes and non-datetimes we may need to access the individual elements. Like:
import pandas as pd
data = {'dates': ['2019-03-01', 87], 'event': [0, 1]}
df = pd.DataFrame(data)
print(df)
# dates event
# 0 2019-03-01 0
# 1 87 1
# Convert only 1 value to datetime
df.loc[0, 'dates'] = pd.to_datetime(df.loc[0, 'dates'])
print(df.loc[0, 'dates'].year)
# 2019
print(df.loc[1, 'dates'].year)
# AttributeError: 'int' object has no attribute 'year'

How to count the number of categorical features with Pandas?

I have a pd.DataFrame which contains different dtypes columns. I would like to have the count of columns of each type. I use Pandas 0.24.2.
I tried:
dataframe.dtypes.value_counts()
It worked fine for other dtypes (float64, object, int64) but for a weird reason, it doesn't aggregate the 'category' features, and I get a different count for each category (as if they would be counted as different values of dtypes).
I also tried:
dataframe.dtypes.groupby(by=dataframe.dtypes).agg(['count'])
But that raises a
TypeError: data type not understood.
Reproductible example:
import pandas as pd
df = pd.DataFrame([['A','a',1,10], ['B','b',2,20], ['C','c',3,30]], columns = ['col_1','col_2','col_3','col_4'])
df['col_1'] = df['col_1'].astype('category')
df['col_2'] = df['col_2'].astype('category')
print(df.dtypes.value_counts())
Expected result:
int64 2
category 2
dtype: int64
Actual result:
int64 2
category 1
category 1
dtype: int64
Use DataFrame.get_dtype_counts:
print (df.get_dtype_counts())
category 2
int64 2
dtype: int64
But if use last version of pandas your solution is recommended:
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
As #jezrael mentioned that it is deprecated in 0.25.0, dtypes.value_counts(0) would give two categoryies, so to fix it do:
print(df.dtypes.astype(str).value_counts())
Output:
int64 2
category 2
dtype: int64

Why list of pd.Interval doesn't recognized by DataFrame automatically?

intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
pd.DataFrame({'d':intervals}).dtypes
Produces dtype as Object not Interval:
>>> d object
>>> dtype: object
But at the same time list of, for example, DateTimes is recognized on the fly:
datetimes = [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]
pd.DataFrame({'d':datetimes}).dtypes
>>> d datetime64[ns]
>>> dtype: object
Is situation with intervals somewhat like with list of strings - default type of column in the DataFrame will be object as well, because DataFrame doesn't 'know' if we want to treat this column as objects (for dumping to disk, ..), or as strings (for concatenation, ..) or even as elements of category type? If so - what different use cases with intervals may be? If not what is the case here?
This is a bug in pandas: https://github.com/pandas-dev/pandas/issues/23563
For now, the cleanest workaround is to wrap the list with pd.array:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
In [3]: pd.DataFrame({'d': pd.array(intervals)}).dtypes
Out[3]:
d interval[float64]
dtype: object

How do I get pandas update function to correctly handle numpy.datetime64?

I have a dataframe with a column that may contain None and another dataframe with the same index that has datetime values populated. I am trying to update the first from the second using pandas.update.
import numpy as np
import pandas as pd
df = pd.DataFrame([{'id': 0, 'as_of_date': np.datetime64('2017-05-08')}])
print(df.as_of_date)
df2 = pd.DataFrame([{'id': 0, 'as_of_date': None}])
print(df2.as_of_date)
df2.update(df)
print(df2.as_of_date)
print(df2.apply(lambda x: x['as_of_date'] - np.timedelta64(1, 'D'), axis=1))
This results in
0 2017-05-08
Name: as_of_date, dtype: datetime64[ns]
0 None
Name: as_of_date, dtype: object
0 1494201600000000000
Name: as_of_date, dtype: object
0 -66582 days +10:33:31.122941
dtype: timedelta64[ns]
So basically update converts the datetime to milliseconds, but keeps the type as object. Then if I try to do date math on it, I get wacky results because numpy doesn't know how to treat it.
I was hoping df2 would look like df1 after updating. How can I fix this?
Try this:
In [391]: df2 = df2.combine_first(df)
In [392]: df2
Out[392]:
as_of_date id
0 2017-05-08 0
In [396]: df2.dtypes
Out[396]:
as_of_date datetime64[ns]
id int64
dtype: object
A two step approach
Fill None data in df2 using date from df:
df2 = df2.combine_first(df)
Update all elements in df2 using elements from df
df2.update(df)
Without 2nd step, df2 will only take the values from df to fill its Nones.