Pandas Rolling Operation on Categorical column - pandas

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

Related

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

How to using define round function like pandas round that executing one line code

Goal
Only one line to execute.
I refer round function from this post. But I want using like df.round(2) which changes the affected columns but keep the sequence of data but not required selecting float or int type.
df.applymap(myfunction) will get TypeError: must be real number, not str, which means I have to select type first.
Try
I refer round source code but I could not and understand how to change my function.
Firstly get the columns where values are float:
cols=df.select_dtypes('float').columns
Finally:
df[cols]=df[cols].agg(round,ndigits=2)
If you want to make changes in the function then add if/else condition:
from numpy import ceil, floor
def float_round(num, places=2, direction=ceil):
if isinstance(num,float):
return direction(num * (10 ** places)) / float(10 ** places)
else:
return num
out=df.applymap(float_round)
With the error message you mention, it's likely the column is already a string, and needs to be converted to some numeric type.
Let's now assume that the column is numeric, there are a few ways you could implement custom rounding functions that don't require reimplementing the .round() method of a dataframe object.
With the requirements you laid above, we want a way to round a data frame that:
fits on one line
doesn't require selecting numeric type
There are two ways we could do this that are functionally equivalent. One is to treat the dataframe as an argument to a function that is safe for numpy arrays.
Another is to use the apply method (explanation here) which applies a function to a row or a column.
import pandas as pd
import numpy as np
from numpy import ceil
# generate a 100x10 dataframe with a null value
data = np.random.random(1000) * 10
data = data.reshape(100,10)
data[0, 0] = np.nan
df = pd.DataFrame(data)
# changing data type of the second column
df[1] = df[1].astype(int)
# verify dtypes are different
print(df.dtypes)
# taken from other stack post
def float_round(num, places=2, direction=ceil):
return direction(num * (10 ** places)) / float(10 ** places)
# method 1 - use the dataframe as an argument
result1 = float_round(df)
print(result1.head())
# method 2 - apply
result2 = df.apply(float_round)
print(result2)
Because apply is applied row or column-wise, you can specify logic in your round function to ignore non-numeric columns. For instance:
# taken from other stack post
def float_round(num, places=2, direction=ceil):
# check type of a specific column
if num.dtype == 'O':
return num
return direction(num * (10 ** places)) / float(10 ** places)
# this will work, method 1 will fail
result2 = df.apply(float_round)
print(result2)

Converting DataFrame into sql

I am using the following code to convert my pandas into sql, but I get the following error although my dtype is float64 for this particular column.
I have tried to convert my dtype to str, but this did not work.
import sqlite3
import pandas as pd
#create db file
db = conn = sqlite3.connect(‘example.db’)
#convert my df data to sql
df = df(‘users’ , con=db, if_exists='replace')
InterfaceError: Error binding parameter 1214 - probably unsupported type.
However when I check the parameter 1214 i.e. column 1214 in my df. This col has a float64 dtype. I don't understand then how to solve this problem.
Double check your data types, as SQLite supports a limited number of data types --> https://www.sqlite.org/datatype3.html. My guess would be to use a float dtype (so try dtype='float')

error performing np.std for array

this is my code, im trying to calculate the standard deviation of an imported list which is shown below
b=[]
#time=[]
with open('nt.txt') as csvfile:
data=csv.reader(csvfile,delimiter=('\t'))
index=0
for line in data:
b.append(line[1])
#out=line[0]
#new=out.split(" ")
#b.append(new[0])
#else:break
x=statistics.stdev(b)
print(x)
with b =['-0,002549', '-0,002040', '-0,001530'] as my output
i get ...
raise TypeError(msg.format(type(x).__name__)) from None
TypeError: can't convert type 'str' to numerator/denominator
results=np.array([[x],[b]]).astype(np.float32)
You have to set the type of the numpy array, not the list.

Performing math operations when plotting Pandas Dataframe columns

I'd like to plot products, ratios, etc of columns in a Pandas Data Frame without first creating a new column containing that product, ratio, etc. E.g.,
[df['A']/df['A']].plot()
doesn't work. For the following code:
x = np.array([[1,2,3],[4,5,6]])
df = pd.DataFrame(x,columns=['A','B','C'])
[df['A']/df['B']].plot()
I get the following error message: "AttributeError: 'list' object has no attribute 'plot' "
The division operation which you are doing in this line:
[df['A']/df['B']].plot()
returns a python list object instead of pandas object.
If you want to plot a particular column first without adding it to the dataframe, you can try this:
import pandas as pd
import numpy as np
x = np.array([[1,2,3],[4,5,6]])
df = pd.DataFrame(x,columns=['A','B','C'])
df['A'].div(df['B']).plot()
which returns a <matplotlib.axes._subplots.AxesSubplot> object