'pandas' has no attribute 'to_float' - pandas

I'm testing my data using the SVM Classifier. And my dataset is in a form of text and I'm trying to transform it into float.
I have data that may look like this:
dataset
Transform as float
df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
TypeError Traceback (most recent call last)
<ipython-input-66-74921537411d> in <module>
1 # Transform as float
----> 2 df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
3
TypeError: 'DataFrame' object is not callable

Basically, it is impossible to convert text to float. In your dataset, it seems that all the columns have text values, and not sure if the value can be numbers by using rstrip('%') (because the values are too long, so truncated in the image).
If the values of a columns can be numbers by using rstrip('%'), then you can convert it. In addition, you are using (), not [] for the dataframe. Because you are using`df(...'), it looks like a function call. You can do what you want if the values of a columns is numbers, as follows:
df['columns'] = df['columns'].str.rstrip('%').astype('float') / 100.0
Here is a full code sample:
import pandas as pd
df = pd.DataFrame({
'column_name': ['111%', '222%'],
})
# df looks like:
# columns
#0 111%
#1 222%
df['column_name'] = df['column_name'].str.rstrip('%').astype('float') / 100.0
print(df)
# columns
#0 1.11
#1 2.22

Related

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

How to using define round function like pandas round that executing one line code

Goal
Only one line to execute.
I refer round function from this post. But I want using like df.round(2) which changes the affected columns but keep the sequence of data but not required selecting float or int type.
df.applymap(myfunction) will get TypeError: must be real number, not str, which means I have to select type first.
Try
I refer round source code but I could not and understand how to change my function.
Firstly get the columns where values are float:
cols=df.select_dtypes('float').columns
Finally:
df[cols]=df[cols].agg(round,ndigits=2)
If you want to make changes in the function then add if/else condition:
from numpy import ceil, floor
def float_round(num, places=2, direction=ceil):
if isinstance(num,float):
return direction(num * (10 ** places)) / float(10 ** places)
else:
return num
out=df.applymap(float_round)
With the error message you mention, it's likely the column is already a string, and needs to be converted to some numeric type.
Let's now assume that the column is numeric, there are a few ways you could implement custom rounding functions that don't require reimplementing the .round() method of a dataframe object.
With the requirements you laid above, we want a way to round a data frame that:
fits on one line
doesn't require selecting numeric type
There are two ways we could do this that are functionally equivalent. One is to treat the dataframe as an argument to a function that is safe for numpy arrays.
Another is to use the apply method (explanation here) which applies a function to a row or a column.
import pandas as pd
import numpy as np
from numpy import ceil
# generate a 100x10 dataframe with a null value
data = np.random.random(1000) * 10
data = data.reshape(100,10)
data[0, 0] = np.nan
df = pd.DataFrame(data)
# changing data type of the second column
df[1] = df[1].astype(int)
# verify dtypes are different
print(df.dtypes)
# taken from other stack post
def float_round(num, places=2, direction=ceil):
return direction(num * (10 ** places)) / float(10 ** places)
# method 1 - use the dataframe as an argument
result1 = float_round(df)
print(result1.head())
# method 2 - apply
result2 = df.apply(float_round)
print(result2)
Because apply is applied row or column-wise, you can specify logic in your round function to ignore non-numeric columns. For instance:
# taken from other stack post
def float_round(num, places=2, direction=ceil):
# check type of a specific column
if num.dtype == 'O':
return num
return direction(num * (10 ** places)) / float(10 ** places)
# this will work, method 1 will fail
result2 = df.apply(float_round)
print(result2)

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.

how to add the following feature to a tfidf matrix?

Hello I have a list called list_cluster, that looks as follows:
list_cluster=["hello,this","this is a test","the car is red",...]
I am using TfidfVectorizer to produce a model as follows:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
with open('vectorizerTFIDF.pickle', 'rb') as infile:
tdf = pickle.load(infile)
tfidf2 = tdf.transform(list_cluster)
then I would like to add new features to this matrix called tfidf2, I have a list as follows:
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
this list has the same lenght of list_cluster, and represents the date has 12 positions and in the place where is the 1 is the corresponding month of the year,
for instance '010000000000' represents february,
in order to use it as feature first I tried:
import numpy as np
dates=np.array(listMonth)
dates=np.transpose(dates)
to get a numpy array and then to transpose it in order to concatenate it with the first matrix tfidf2
print("shape tfidf2: "+str(tfidf2.shape),"shape dates: "+str(dates.shape))
in order to concatenate my vector and matrix I tried:
tfidf2=np.hstack((tfidf2,dates[:,None]))
However this is the output:
shape tfidf2: (11159, 1927) shape dates: (11159,)
Traceback (most recent call last):
File "Main.py", line 230, in <module>
tfidf2=np.hstack((tfidf2,dates[:,None]))
File "/usr/local/lib/python3.5/dist-packages/numpy/core/shape_base.py", line 278, in hstack
return _nx.concatenate(arrs, 0)
ValueError: all the input arrays must have same number of dimensions
the shape seems good, but I am not sure what is failing, I would like to appreciate support to concatenate this feature to my tfidf2 matrix, thanks in advance for the atention,
You need to convert all strings to numerics for sklearn. One way to do this is use the LabelBinarizer class in the preprocessing module of sklearn. This creates a new binary column for each unique value in your original column.
If dates is the same number of rows as tfidf2 then I think this will work.
# create tfidf2
tfidf2 = tdf.transform(list_cluster)
#create dates
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
# binarize dates
lb = LabelBinarizer()
b_dates = lb.fit_transform(dates)
new_tfidf = np.concatenate((tfidf2, b_dates), axis=1)