How to count the number of categorical features with Pandas? - pandas

I have a pd.DataFrame which contains different dtypes columns. I would like to have the count of columns of each type. I use Pandas 0.24.2.
I tried:
dataframe.dtypes.value_counts()
It worked fine for other dtypes (float64, object, int64) but for a weird reason, it doesn't aggregate the 'category' features, and I get a different count for each category (as if they would be counted as different values of dtypes).
I also tried:
dataframe.dtypes.groupby(by=dataframe.dtypes).agg(['count'])
But that raises a
TypeError: data type not understood.
Reproductible example:
import pandas as pd
df = pd.DataFrame([['A','a',1,10], ['B','b',2,20], ['C','c',3,30]], columns = ['col_1','col_2','col_3','col_4'])
df['col_1'] = df['col_1'].astype('category')
df['col_2'] = df['col_2'].astype('category')
print(df.dtypes.value_counts())
Expected result:
int64 2
category 2
dtype: int64
Actual result:
int64 2
category 1
category 1
dtype: int64

Use DataFrame.get_dtype_counts:
print (df.get_dtype_counts())
category 2
int64 2
dtype: int64
But if use last version of pandas your solution is recommended:
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.

As #jezrael mentioned that it is deprecated in 0.25.0, dtypes.value_counts(0) would give two categoryies, so to fix it do:
print(df.dtypes.astype(str).value_counts())
Output:
int64 2
category 2
dtype: int64

Related

How to change the a specific value of a row of a pandas Series?

I have the following pandas Series:
trade dtype
trade_action category
execution_venue object
from_implied int64
In the last row I would like to change the from_implied name to implied. How can I do this?
Expected output:
trade dtype
trade_action category
execution_venue object
implied int64
Here's what you can do:
ser = pd.Series(["trade","trade_Action","execution_venue","from_implied"])
ser2 = ser.replace(to_replace = "from_implied", value = "implied")
With ser.replace you can change the values of a pd.Series as above.
Assuming that the first column is your Series' index, you can use the pd.Series.rename method on your series:
import pandas as pd
# Your series here
series = pd.read_clipboard().set_index("trade")["dtype"]
out = series.rename({"from_implied": "implied"})
out:
trade
trade_action category
execution_venue object
implied int64
Name: dtype, dtype: object

The asType does not work in Pandas to int64?

I try to convert columns type to int64:
new_df.astype({'NUM': 'int64'})
Afetr df.info() I see this:
0 NUM 10 non-null object
Why?
The type casting is not done in-place, DataFrame.astype returns a new DataFrame with the correct types. So you have to reassign the result to new_df.
new_df = new_df.astype({'NUM': 'int64'})
print(new_df.info())

Pandas set column value to 1 if other column value is NaN

I've looked everywhere tried .loc .apply and using lambda but I still cannot figure this out.
I have the UCI congressional vote dataset in a pandas dataframe and some votes are missing for votes 1 to 16 for each Democrat or Republican Congressperson.
So I inserted 16 columns after each vote column called abs.
I want each abs column to be 1 if the corresponding vote column is NaN.
None of those above methods I read on this site worked for me.
So I have this snippet below that also does not work but it might give a hint as to my current attempt using basic iterative Python syntax.
for i in range(16):
for j in range(len(cvotes['v1'])):
if cvotes['v{}'.format(i+1)][j] == np.nan:
cvotes['abs{}'.format(i+1)][j] = 1
else:
cvotes['abs{}'.format(i+1)][j] = 0
Any suggestions?
The above currently gives me 1 for abs when the vote value is NaN or 1.
edit:
I saw the given answer so tried this with just one column
cols = ['v1']
for col in cols:
cvotes = cvotes.join(cvotes[col].add_prefix('abs').isna().
astype(int))
but it's giving me an error:
ValueError: columns overlap but no suffix specified: Index(['v1'], dtype='object')
My dtypes are:
party object
v1 float64
v2 float64
v3 float64
v4 float64
v5 float64
v6 float64
v7 float64
v8 float64
v9 float64
v10 float64
v11 float64
v12 float64
v13 float64
v14 float64
v15 float64
v16 float64
abs1 int64
abs2 int64
abs3 int64
abs4 int64
abs5 int64
abs6 int64
abs7 int64
abs8 int64
abs9 int64
abs10 int64
abs11 int64
abs12 int64
abs13 int64
abs14 int64
abs15 int64
abs16 int64
dtype: object
Let us just do join with add_prefix
col=[c1,c2...]
s=pd.DataFrame(df[col].values.tolist(),index=df.index)
s.columns=s.columns+1
df=df.join(s.add_prefix('abs').isna().astype(int))

How do I get pandas update function to correctly handle numpy.datetime64?

I have a dataframe with a column that may contain None and another dataframe with the same index that has datetime values populated. I am trying to update the first from the second using pandas.update.
import numpy as np
import pandas as pd
df = pd.DataFrame([{'id': 0, 'as_of_date': np.datetime64('2017-05-08')}])
print(df.as_of_date)
df2 = pd.DataFrame([{'id': 0, 'as_of_date': None}])
print(df2.as_of_date)
df2.update(df)
print(df2.as_of_date)
print(df2.apply(lambda x: x['as_of_date'] - np.timedelta64(1, 'D'), axis=1))
This results in
0 2017-05-08
Name: as_of_date, dtype: datetime64[ns]
0 None
Name: as_of_date, dtype: object
0 1494201600000000000
Name: as_of_date, dtype: object
0 -66582 days +10:33:31.122941
dtype: timedelta64[ns]
So basically update converts the datetime to milliseconds, but keeps the type as object. Then if I try to do date math on it, I get wacky results because numpy doesn't know how to treat it.
I was hoping df2 would look like df1 after updating. How can I fix this?
Try this:
In [391]: df2 = df2.combine_first(df)
In [392]: df2
Out[392]:
as_of_date id
0 2017-05-08 0
In [396]: df2.dtypes
Out[396]:
as_of_date datetime64[ns]
id int64
dtype: object
A two step approach
Fill None data in df2 using date from df:
df2 = df2.combine_first(df)
Update all elements in df2 using elements from df
df2.update(df)
Without 2nd step, df2 will only take the values from df to fill its Nones.

Grouped function between 2 columns in a pandas.DataFrame?

I have a dataframe that has multiple numerical data columns, and a 'group' column. I want to get the output of various functions over two of the columns, for each group.
Example data and function:
df = pandas.DataFrame({"Dummy":[1,2]*6, "X":[1,3,7]*4,
"Y":[2,3,4]*4, "group":["A","B"]*6})
def RMSE(X):
return(np.sqrt(np.sum((X.iloc[:,0] - X.iloc[:,1])**2)))
I want to do something like
group_correlations = df[["X", "Y"]].groupby('group').apply(RMSE)
But if I do that, the 'group' column isn't in the dataframe. If I do it the other way around, like this:
group_correlations = df.groupby('group')[["X", "Y"]].apply(RMSE)
Then the column selection doesn't work:
df.groupby('group')[['X', 'Y']].head(1)
Dummy X Y group
group
A 0 1 1 2 A
B 1 2 3 3 B
the Dummy column is still included, so the function will calculate RMSE on the wrong data.
Is there any way to do what I'm trying to do? I know I could do a for loop over the different groups, and subselect the columns manually, but I'd prefer to do it the pandas way, if there is one.
This looks like a bug (or that grabbing multiple columns in a groupby is not implemented?), a workaround is to pass in the groupby column directly:
In [11]: df[['X', 'Y']].groupby(df['group']).apply(RMSE)
Out[11]:
group
A 4.472136
B 4.472136
dtype: float64
To see it's the same:
In [12]: df.groupby('group')[['X', 'Y']].apply(RMSE) # wrong
Out[12]:
group
A 8.944272
B 7.348469
dtype: float64
In [13]: df.iloc[:, 1:].groupby('group')[['X', 'Y']].apply(RMSE) # correct: ignore dummy col
Out[13]:
group
A 4.472136
B 4.472136
dtype: float64
More robust implementation:
To avoid this completely, you could change RMSE to select the columns by name:
In [21]: def RMSE2(X, left_col, right_col):
return(np.sqrt(np.sum((X[left_col] - X[right_col])**2)))
In [22]: df.groupby('group').apply(RMSE2, 'X', 'Y') # equivalent to passing lambda x: RMSE2(x, 'X', 'Y'))
Out[22]:
group
A 4.472136
B 4.472136
dtype: float64
Thanks to #naught101 for pointing out the sweet apply syntax to avoid the lambda.