mleap support Spark ML Imputer - serialization

Reading through the mleap documentation I can see that Spark ML Imputer is on the list of supported transformers.
However, when I try to serialize pipeline in pyspark I am getting the java.util.NoSuchElementException: key not found: org.apache.spark.ml.feature.ImputerModel.
Does this mean that the Imputer is not supported?
I have found a ticket in mleap repo about this problem - does it mean that only an MLeap version of the spark Imputer is supported(the one from mleap-spark-extension)? How can I use it from pyspark? (In such case the documentation is very misleading and should mention this somewhere).
My code failing to serialize pipeline (pyspark 3.0.3, mleap 0.19.0):
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer
from pyspark.sql import SparkSession
from mleap.pyspark.spark_support import SimpleSparkSerializer
input = [
{"a": 0, "b": None},
{"a": None, "b": 0},
{"a": 10, "b": None},
{"a": None, "b": 10},
]
spark = SparkSession.builder \
.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.19.0') \
.config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all") \
.getOrCreate()
df = spark.sparkContext.parallelize(input).toDF()
pip = Pipeline(stages=[
Imputer(strategy="mean", inputCols=["a", "b"], outputCols=["a", "b"])
])
fitted_pip = pip.fit(df)
fitted_pip.serializeToBundle("jar:file:/tmp/test-pip.zip", fitted_pip.transform(df))

Related

Save a pandas dataframe containing numpy arrays

I have a dataframe with a column full of numpy arrays.
A B C
0 1.0 0.000000 [[0. 1.],[0. 1.]]
1 2.0 0.000000 [[85. 1.],[52. 0.]]
2 3.0 0.000000 [[5. 1.],[0. 0.]]
3 1.0 3.333333 [[0. 1.],[41. 0.]]
4 2.0 3.333333 [[85. 1.],[0. 21.]]
Problem is, when I save it as a CSV file, and when i load it on another python file, the numpy column is read as text.
I tried to transform the column with np.fromstring() or np.loadtxt() but it doesn't work.
Example of and array after pd.read_csv()
"[[ 85. 1.]\n [ 52. 0. ]]"
Thanks
You can try .to_json()
output = pd.DataFrame([
{'a':1,'b':np.arange(4)},
{'a':2,'b':np.arange(5)}
]).to_json()
But you will get only lists back when reloading with
df=pd.read_json()
Turn them to numpy arrays with:
df['b']=[np.array(v) for v in df['b']]
The code below should work. I used another question to solve it, theres a bit more explanation in there: Convert a string with brackets to numpy array
import pandas as pd
import numpy as np
from ast import literal_eval
# Recreating DataFrame
data = np.array([0, 1, 0, 1, 85, 1, 52, 0, 5, 1, 0, 0, 0, 1, 41, 0, 85, 1, 0, 21], dtype='float')
data = data.reshape((5,2,2))
write_df = pd.DataFrame({'A': [1.0,2.0,3.0,1.0,2.0],
'B': [0,0,0,3+1/3,3+1/3],
'C': data.tolist()})
# Saving DataFrame to CSV
fpath = 'D:\\Data\\test.csv'
write_df.to_csv(fpath)
# Reading DataFrame from CSV
read_df = pd.read_csv(fpath)
# literal_eval converts the string to a list of tuples
# np.array can convert this list of tuples directly into an array
def makeArray(rawdata):
string = literal_eval(rawdata)
return np.array(string)
# Applying the function row-wise, there could be a more efficient way
read_df['C'] = read_df['C'].apply(lambda x: makeArray(x))
Here is an ugly solution.
import pandas as pd
import numpy as np
### Create dataframe
a = [1.0, 2.0, 3.0, 1.0, 2.0]
b = [0.000000,0.000000,0.000000,3.333333,3.333333]
c = [np.array([[0. ,1.],[0. ,1.]]),
np.array([[85. ,1.2],[52. ,0.]]),
np.array([[5. ,1.],[0. ,0.]]),
np.array([[0. ,1.],[41. ,0.]]),
np.array([[85. ,1.],[0. ,21.]]),]
df = pd.DataFrame({"a":a,"b":b,"c":c})
#### Save to csv
df.to_csv("to_trash.csv")
df = pd.read_csv("to_trash.csv")
### Bad string manipulation that could be done better with regex
df["c"] = ("np.array("+(df
.c
.str.split()
.str.join(' ')
.str.replace(" ",",")
.str.replace(",,",",")
.str.replace("[,", "[", regex=False)
)+")").apply(lambda x: eval(x))
The best solution I found is using Pickle files.
You can save your dataframe as a pickle file.
import pickle
img = cv2.imread('img1.jpg')
data = pd.DataFrame({'img':img})
data.to_pickle('dataset.pkl')
Then you can read is as pickle file:
with (open(ref_path + 'dataset.pkl', "rb")) as openfile:
df_file = pickle.load(openfile)
Let me know if it worked.

Optimized way to find values with top 20 frequencies in spark dataframe

We have a spark dataframe. We are trying to find the values with top 20 frequencies in a column.
Ex) [1, 1, 1, 2, 2, 4]
In the above list,
1 is occuring 3 times
2 is occuring 2 times
4 is occuring 1 time
We are trying to find this using pandas.
And then creating UDF in spark and using it there.
This works for smaller datasets, but when the datasets are too tall (20M rows), we are facing memory issues sometimes.
from pyspark.sql import functions as F
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("unit_testing_spark").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
def find_freq_values(col_list):
if len(col_list) == 0:
return []
df = pd.DataFrame(col_list, columns=["value"])
df = df[['value']].groupby(['value'])['value'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(20)
res = df.to_dict(orient='records')
for curr_data in res:
curr_value = curr_data["value"]
ldt = str(type(curr_value)).lower()
if "time" in ldt or "date" in ldt:
curr_data["value"] = str(curr_value)
return res
s = find_freq_values([1, 1, 1, 2, 2, 4])
print(s)
# Output: [{'value': 1, 'count': 3}, {'value': 2, 'count': 2}, {'value': 4, 'count': 1}]
column_data = ["col_1"]
column_header = tuple(column_data)
data = [[1], [1], [1], [2], [2], [4]]
df = spark.createDataFrame(data, column_header)
find_freq_udf = F.udf(find_freq_values, ArrayType(MapType(StringType(), StringType(), True)))
freq_res_df = df.select(*[find_freq_udf(F.collect_list(c)).alias(c) for c in df.columns])
freq_res = freq_res_df.collect()[0].asDict()
print(freq_res)
# Output: {'col_1': [{'count': '3', 'value': '1'}, {'count': '2', 'value': '2'}, {'count': '1', 'value': '4'}]}
Error message:
"An error occurred while calling o514.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 10 times, most recent failure: Lost task 0.9 in stage 66.0 (TID 379) (w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container from a bad node: container_1658234225716_0001_01_000011 on host: w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal. Exit status: 143.
To perform multiple columns parallely we are using the following statement.
How can this be optimized to avoid memory issue?
One way would be only use spark for computing the frequencies. here is one way to do it.
from pyspark.sql.functions import count, col
x = [1, 1, 1, 2, 2, 4] # input data
df = sc.parallelize(x).toDF(['ID'])
df = df.groupBy('ID')
df = df.agg(count('ID').alias('id_count'))
df = df.orderBy(col('id_count').asc())
df = df.limit(20)
df.show()

Adding tags to S3 objects using awswrangler?

I'm using awswrangler to write parquets in my S3 and I usually add tags on all my objects to access and cost control, but I didn't find a way to do that using directly awswrangler. I'm current using the code below to test:
import awswrangler as wr
import boto3
import pandas as pd
# Boto session
session = boto3.Session(profile_name='my_profile')
# Dummy pandas dataframe
d = {'col1': [1, 2], 'col2': [3, 4]}
df_pandas = pd.DataFrame(data=d)
wr.s3.to_parquet(df=df_pandas, path='s3://my-bucket/path/', boto3_session=session)
There is a way to add tags to the objects that .to_parquet will write in my S3?
I just figured out that awswrangler has a parameter called s3_additional_kwargs that you can pass additional variables to the s3 requests that awswrangler does for you. You can send tags like in boto3 'Key1=value1&Key2=value2'
Below is an example how to add tags to your objects:
import awswrangler as wr
import boto3
import pandas as pd
# Tagging
tag_set = 'Key1=value1&Key2=value2'
# Boto session
session = boto3.Session(profile_name='my_profile')
# Dummy pandas dataframe
d = {'col1': [1, 2], 'col2': [3, 4]}
df_pandas = pd.DataFrame(data=d)
wr.s3.to_parquet(df=df_pandas, path='s3://my-bucket/path/', s3_additional_kwargs={'Tagging': tag_set}, boto3_session=session)

How to construct a "text/csv" payload when invoking a sagemaker endpoint

My training data looks like
df = pd.DataFrame({'A' : [2, 5], 'B' : [1, 7]})
I have trained a model in AWS Sagemaker and I deployed the model behind an endpoint.
The endpoint accepts the payload as "text/csv".
to invoke the endpoint using boto3 you can do:
import boto3
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName="my-sagemaker-endpoint-name",
Body= my_payload_as_csv,
ContentType = 'text/csv')
How do i construct the payload "my_payload_as_csv" from my Dataframe in order to invoke the Sagemaker Endpoint correctly?
if you start from the dataframe example
df = pd.DataFrame({'A' : [2, 5], 'B' : [1, 7]})
you take a row
df_1_record = df[:1]
and convert df_1_record to a csv like this:
import io
from io import StringIO
csv_file = io.StringIO()
# by default sagemaker expects comma seperated
df_1_record.to_csv(csv_file, sep=",", header=False, index=False)
my_payload_as_csv = csv_file.getvalue()
my_payload_as_csv looks like
'2,1\n'
then you can invoke the sagemaker endpoint
import boto3
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName="my-sagemaker-endpoint-name",
Body= my_payload_as_csv,
ContentType = 'text/csv')
#VincentCleas's answer was good. But, If you want to construct csv-payload without installing pandas, do this:
import boto3
csv_buffer = open('<file-name>.csv')
my_payload_as_csv = csv_buffer.read()
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName="my-sagemaker-endpoint-name",
Body= my_payload_as_csv,
ContentType = 'text/csv')

Pandas input function with multiple value sparse categorical data

Given a pandas dataframe
df = pd.DataFrame([
[1, ["a", "b"], 10],
[2, ["b"], 20],
], columns= ["a", "b", "label"])
Where a column "b" is a list of values, representing sparse categorical data, how can I create an input function to feed to estimator in train and predict?
Using padas_input_fn it does not work, because of the b column:
train_fn = tf.estimator.inputs.pandas_input_fn(x=df[["a", "b"]], y=df.label, shuffle=True)
-- Error --
tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.
I can create a tfrecords file, write the data using BytesList for column b, and read it using TFRecordDataset, than with a parse func to parse column b using a varLenFeature, it works.
But how can I feed this data into estimator using an in memory object/dataframe and/or pandas input fn?
Below is my all code:
import tensorflow as tf
import pandas as pd
from tensorflow.estimator.inputs import pandas_input_fn
from tensorflow.estimator import DNNRegressor
from tensorflow.feature_column import numeric_column, indicator_column, categorical_column_with_vocabulary_list
from tensorflow.train import Feature, Features, BytesList, FloatList, Example
from tensorflow.python_io import TFRecordWriter
df = pd.DataFrame([
[1, ["a", "b"], 10],
[2, ["b"], 20],
], columns= ["a", "b", "label"])
writer = TFRecordWriter("test.tfrecord")
for row in df.iterrows():
dict_feature = {}
label_values = []
for e in row[1].iteritems():
if e[0] =="a":
dict_feature.update({e[0]: Feature(float_list=FloatList(value=[e[1]]))})
elif e[0] == "b":
dict_feature.update({e[0]: Feature(bytes_list=BytesList(value=[m.encode('utf-8') for m in e[1]]))})
elif e[0] == "label":
dict_feature.update({e[0]: Feature(float_list=FloatList(value=[e[1]]))})
example = Example(features=Features(feature=dict_feature))
writer.write(example.SerializeToString())
writer.close()
def parse_tfrecords(example_proto):
feature_description = {}
feature_description.update({"a": tf.FixedLenFeature(shape=[], dtype=tf.float32)})
feature_description.update({"b": tf.VarLenFeature(dtype=tf.string)})
feature_description.update({"label": tf.FixedLenFeature(shape=[], dtype=tf.float32)})
parsed_features = tf.parse_single_example(example_proto, feature_description)
features = { key: parsed_features[key] for key in ["a", "b"] }
label = parsed_features["label"]
return features, label
def tf_record_input_fn(filenames_pattern):
def _input_fn():
dataset = tf.data.TFRecordDataset(filenames=filenames_pattern)
dataset = dataset.shuffle(buffer_size=128)
dataset = dataset.map(map_func=parse_tfrecords)
dataset = dataset.batch(batch_size=128)
return dataset
return _input_fn
feature_columns = [
numeric_column("a"),
indicator_column(categorical_column_with_vocabulary_list("b", vocabulary_list=['a', 'b']))
]
estimator = DNNRegressor(feature_columns=feature_columns, hidden_units=[1])
train_input_fn = tf_record_input_fn("test.tfrecord")
# Next line does not work
# train_input_fn = tf.estimator.inputs.pandas_input_fn(x=df[["a", "b"]], y=df.label, shuffle=True)
estimator.train(train_input_fn)
I do not have a complete solution for your query because of my lack of experience with tensorflow.estimator API, but is it possible for you to reshape your dataframe instead? If the values in lists of column b are categorical in nature, maybe you can try one-hot encoding them and in that process add more columns to df? That way, your df will become process-able to all estimators in general.