PySpark Create DataFrame With Float TypeError - dataframe

I have Data Sets as Below:
I am using PySpark to parse the data and create a DataFrame later using below code:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
def parseInput(line):
fields = line.split(',')
stationID=fields[0]
entryType=fields[2]
temperature= fields[3]*0.3
return Row(stationID,entryType,temperature)
spark = SparkSession.builder.appName("MinTemperatures").getOrCreate()
lines = spark.sparkContext.textFile("data/1800.csv")
temperatures = lines.map(parseInput)
minTemps=temperatures.filter(lambda x:x[1]=='TMIN')
df = spark.createDataFrame(minTemps)
I got below error:
TypeError: can't multiply sequence by non-int of type 'float'
Obviously, if I remove 0.3 out of temperature= fields[3]*0.3, the create DataFrame work. How can I return the temperature with float number and some basic math operation?

Try temperature= float(fields[3])*0.3

You can read the file without multiplication first and then cast it to Type Double, do the multiplication finally.
I assume your csv file have header.
The following code is for casting:
data = data.withColumn("COLUMN_NAME", data["COLUMN_NAME"].cast("double"))

Related

Pandas series pad function not working with apply in pandas

I am trying to write the code to pad columns of my pandas dataframe with different characters. I tried to use apply function to fill '0' with zfill and it works.
print(df["Date"].apply(lambda x: x.zfill(10)))
But when I try to use pad function using apply method to my dataframe I face error:
AttributeError: 'str' object has no attribute 'pad'
The code I am trying is:
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0")))
Both the zfill and pad functions are a part of pandas.Series.str. I am confused why pad is not working and zfill works. How can I achieve this functionality?
Full code:
import pandas as pd
from io import StringIO
StringData = StringIO(
"""Date,Time
パンダ,パンダ
パンダサンDA12-3,パンダーサンDA12-3
パンダサンDA12-3,パンダサンDA12-3
"""
)
df = pd.read_csv(StringData, sep=",")
print(df["Date"].apply(lambda x: x.zfill(10))) -- works
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0"))) -- doesn't work
I am using pandas 1.5.1.
You should just not use apply, doing so you don't benefit from Series methods, but rather use pure python str methods:
print(df["Date"].str.zfill(10))
print(df["Date"].str.pad(10, side="left", fillchar="0"))
output:
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
multiple columns:
Now, you need to use apply, but this is DataFrame.apply, not Series.apply:
df[['col1', 'col2', 'col3']].apply(lambda s: s.str.pad(10, side="left", fillchar="0"))

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

Creating PySpark UDFs from python method with numpy array input, to calculate and return a single float value

As input I have a csv file with int values in it.
spark_df = spark.read.option("header", "false").csv("../int_values.csv")
df = spark_df.selectExpr("_c0 as something")
_df = df.withColumn("values", df.something.cast(FloatType())).select("values")
I also have some python functions designed for numpy array inputs, that I need to apply on the Spark DataFrame.
The example one:
def calc_sum(float_array):
return np.sum(float_array)
Real function:
def calc_rms(float_array):
return np.sqrt(np.mean(np.diff(float_array)**2))
For the 1. example you can use SQL sum like:
_df.groupBy().sum().collect()
But, what I need is a standard solution to transform these functions into Spark UDFs
I tried many ways, like:
udf_sum = udf(lambda x : calc_sum(x), FloatType())
_df.rdd.flatMap(udf_sum).collect()
but it always failed with:
TypeError: Invalid argument, not a string or column:
Row(values=1114.0) of type <class 'pyspark.sql.types.Row'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.
Is it possible to transform the data in a way that works with these functions?
DataFrame sample:
In [6]: spark_df.show()
+----+
| _c0|
+----+
|1114|
|1113|
|1066|
|1119|
|1062|
|1089|
|1093|
| 975|
|1099|
|1062|
|1062|
|1162|
|1057|
|1123|
|1141|
|1089|
|1172|
|1096|
|1164|
|1146|
+----+
only showing top 20 rows
Expected output:
A Float value returned from the UDF.
For the Sum function it should be clear.
What you want is groupby and use collect_list to get all integer values into an array column then apply your UDF on that column. Also, you need to explicitly return float from calc_rms:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
def calc_rms(float_array):
return float(np.sqrt(np.mean(np.diff(float_array) ** 2)))
calc_rms_udf = F.udf(calc_rms, FloatType())
df.groupby().agg(F.collect_list("_c0").alias("_c0")) \
.select(calc_rms_udf(F.col("_c0")).alias("rms")) \
.show()
#+--------+
#| rms|
#+--------+
#|67.16202|
#+--------+

Converting data frame columns in category type in pyspark

I have a data frame df and there I want to convert some columns into category type. Using pandas I can do it like below way:
for col in categorical_collist:
df[col] = df[col].astype('category')
I want to do the column conversion in pyspark. How can I do it?
I have tried using the below code in pyspark. But it is not giving my expected output during operation.
from pyspark.sql.types import StringType
for col in categorical_collist:
df = df.withColumn(col, df[col].cast(StringType()))

Converting DataFrame into sql

I am using the following code to convert my pandas into sql, but I get the following error although my dtype is float64 for this particular column.
I have tried to convert my dtype to str, but this did not work.
import sqlite3
import pandas as pd
#create db file
db = conn = sqlite3.connect(‘example.db’)
#convert my df data to sql
df = df(‘users’ , con=db, if_exists='replace')
InterfaceError: Error binding parameter 1214 - probably unsupported type.
However when I check the parameter 1214 i.e. column 1214 in my df. This col has a float64 dtype. I don't understand then how to solve this problem.
Double check your data types, as SQLite supports a limited number of data types --> https://www.sqlite.org/datatype3.html. My guess would be to use a float dtype (so try dtype='float')