Invoke Sagemaker Endpoint using Spark (EMR Cluster) - amazon-s3

I am developing a spark application in an EMR cluster. The flow of the project goes like this :
Dataframe is repartitioned based in a Id.
Sagemaker endpoint needs to be invoked on each partition and get the result.
But doing that i am getting this error :
cPickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects
The code is a follows :
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
import itertools
import json
import boto3
import time
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from io import BytesIO as StringIO
client=boto3.client('sagemaker-runtime')
def invoke_endpoint(json_data):
ansJson=json.dumps(json_data)
response=client.invoke_endpoint(EndpointName="<EndpointName>",Body=ansJson,ContentType='text/csv',Accept='Accept')
resultJson=json.loads(str(response['Body'].read().decode('ascii')))
return resultJson
def execute(list_of_url):
final_iterator=[]
urlist=[]
json_data={}
for url in list_of_url:
final_iterator.append((url.ID,url.Prediction))
urlist.append(url.ID)
json_data['URL']=urlist
ressultjson=invoke_endpoint(json_data)
return iter(final_iterator)
### Atributes to be added to Spark Conf
conf = (SparkConf().set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true").set("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true"))
scT=SparkContext(conf=conf)
scT.setSystemProperty("com.amazonaws.services.s3.enableV4","true")
hadoopConf=scT._jsc.hadoopConfiguration()
hadoopConf.set("f3.s3a.awsAccessKeyId","<AccessKeyId>")
hadoopConf.set("f3.s3a.awsSecretAccessKeyId","<SecretAccessKeyId>")
hadoopConf.set("f3.s3a.endpoint","s3-us-east-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3.enableV4","true")
hadoopConf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sql=SparkSession(scT)
csv_df=sql.read.csv('s3 path to my csv file',header =True)
#print('Total count is',csv_df.count())
csv_dup_df=csv_df.dropDuplicates(['ID'])
print('Total count is',csv_dup_df.count())
windowSpec=Window.orderBy("ID")
result_df=csv_dup_df.withColumn("ImageID",F.row_number().over(windowSpec)%80)
final_df=result_df.withColumn("Prediction",lit(str("UNKOWN")))
df2 = final_df.repartition("ImageID")
df3=df2.rdd.mapPartitions(lambda url: execute(url)).toDF()
df3.coalesce(1).write.mode("overwrite").save("s3 path to save the results in csv format",format="csv")
print(df3.rdd.glom().collect())
##Ok
print("Work is Done")
Can you tell me how to rectify this issue ?

Related

Spark Dataframe .show() output not aligned

I have local Spark installed. Running in VSCode on Jupyter Notebook.
Using this test code to create small dataframe and show it in the console using .show(), but my output is not aligned:
# %%
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("local").appName("my-application-name").getOrCreate()
)
sc = spark.sparkContext
spark.conf.set("spark.sql.shuffle.partitions", "5")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()
columns = ["language", "users_count"]
data = [
("Java", "20000"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show(truncate=False)
Also converting to pandas and printing shows similarly:
df_pd = df.toPandas()
print(df_pd)
Can you help me, where can I look to try to fix it?
Thanks

How to create multiple pandas profiling reports for multiple csv files in a directory? The report name should match the file name

I tried this,
import glob
import os
import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
files = glob.glob("D:\home_health_services_current_data\*.csv")
df = pd.DataFrame()
for f in files:
csv = pd.read_csv(f)
df = df.append(csv)
profile = ProfileReport(df, title="Profiling Report", explorative=True)
profile.to_file("D:\proj_report\profilerep\prof_report.html")

Accessing methods within a class from bokeh FileInput widget

I am working on a Bokeh serve UI and am running into trouble interfacing a class (and its methods) with the FileInput widget. I am using a class (in this example, called "EIS_data") which, when instantiated, loads a file using pd.read_csv. The EIS_data class also has a method to plot the data in a particular way, and I'd like to be able to load the pandas dataframe and call and manipulate the data using the methods already in place in the class.
So far, I have been able to load the data successfully using the FileInput widget, but I can't figure out how to access the dataframe again once it's loaded in. In a standalone Jupyter notebook, I could run d = EIS_data("filename") and then ```d.plot''' to load the data into a pandas dataframe and then plot it according to the method defined in the EIS_data class, but I can't figure out how to replicate this in the UI code once the data are loaded using the FileInput widget.
Is there a way I can interface this with Bokeh widgets, such that I could simply add d.plot() to curdoc()? I have found a workaround using ColumnDataSource, but it seems a shame to redefine plotting methods and data handling when they are already defined in the class. Below are minimal working examples of the UI code and the class definition.
UI Code:
import numpy as np
import pandas as pd
from eis_analysis_trimmed import EIS_data
import bokeh
from bokeh.io import curdoc
from bokeh import layouts
from bokeh.layouts import column,row,gridplot
from bokeh.plotting import figure
from bokeh.models import *
import base64
import io
## Instantiate the EIS_data class for loading data
def load_data(f):
return EIS_data(f)
## updater function called to load data with FileInput widget
## Must be decoded using base64
def load_file(attr, old, new):
decoded = base64.b64decode(new)
d = io.BytesIO(decoded)
dat = load_data(d)
print(dat.df)
print(dat)
print("EIS Data Uploaded Successfully")
return dat
f_load = Paragraph(text="""Load Data""",height=15)
f = FileInput()
f.on_change('value',load_file)
curdoc().add_root(column(f))
and here is the EIS_data class:
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
from bokeh.plotting import figure, show
from bokeh.models import LinearAxis, Range1d
from bokeh.resources import INLINE
import bokeh.io
#locally include javascript dependencies in html
bokeh.io.output_notebook(INLINE)
class EIS_data:
def __init__(self, file_name, delimiter='\t',
header=0, f_low=None, f_high=None):
#load eis data into a pandas dataframe
eis_data = pd.read_csv(file_name, delimiter=delimiter, header=header)
#iterate through all of the columns and check to see
#if all of the values in that column are null
#if they are, then remove that column
for c in eis_data.columns:
if eis_data[c].isnull().all():
eis_data = eis_data.drop([c], axis=1)
#make sure that the data are imported as floats and not strings
eis_data = eis_data[['freq/Hz', 'Re(Z)/Ohm', '-Im(Z)/Ohm']]
eis_data['freq/Hz'] = pd.to_numeric(eis_data['freq/Hz'])
eis_data['Re(Z)/Ohm'] = pd.to_numeric(eis_data['Re(Z)/Ohm'])
eis_data['-Im(Z)/Ohm'] = pd.to_numeric(eis_data['-Im(Z)/Ohm'])
self.df = eis_data.sort_values(by='freq/Hz')
def plot(self, fit_vals = None):
plot = figure(title="Nyquist Plot",
x_axis_label='Re(Z) Ohm',
y_axis_label='-Im(Z) Ohm',
plot_width=600,
plot_height=600)
plot.circle(self.df['Re(Z)/Ohm'], self.df['-Im(Z)/Ohm'],
size=7, color='navy', name='Data')
return plot
EDIT: Adding the workaround using ColumnDataSource
from bokeh.layouts import column
from bokeh.plotting import figure
from bokeh.models import *
from bokeh.models.widgets import FileInput
import base64
import io
from eis_analysis2 import EIS_data
# Instantiate the EIS_data class for loading data
def load_data(data):
return EIS_data(data)
# updater function called to load data with FileInput widget
# Must be decoded using base64
def load_file(attr, old, new):
decoded = base64.b64decode(new)
d = io.BytesIO(decoded)
dat = load_data(d)
dat_df = dat.df
# Replace plot data with data from newly-loaded file
source.data = dict(freq=dat_df[dat_df.columns[0]], reZ=dat_df[dat_df.columns[1]], imZ=dat_df[dat_df.columns[2]])
#phase,mag = bode_calc(reZ,imZ)
print(dat_df)
print("EIS Data Uploaded Successfully")
# Create Column Data Source that will be used by the plot
source = ColumnDataSource(data=dict(freq=[], reZ=[], imZ=[]))
##Make the nyquist plot
nyq_plot = figure(title="Nyquist Plot",
x_axis_label='Re(Z) Ohm',
y_axis_label='-Im(Z) Ohm',
plot_width=600,
plot_height=600)
nyq_plot.circle(x="reZ", y="imZ",source=source,size=7, color='navy', name='Data')
f = FileInput()
f.on_change('value', load_file)
layout = column(f, nyq_plot)
curdoc().add_root(layout)

To read csv in google colaboratore

from google.colab import files
uploaded = files.upload()
import io
def ls(ruta = uploaded):
return [arch.name for arch in io.StringIO((ruta)) if arch.is_file()]
divisas = ls()
I have this error:
TypeError: initial_value must be str or None, not dict
from google.colab import files
uploaded = files.upload()
Import the google.colab library for file upload then upload the file and pass file name inside the pandas read_csv function
import io
import pandas as pd
df2 = pd.read_csv(io.BytesIO(uploaded['heart.csv']))
df2.head()
I need to include the file names in divisas list

Zipline - How to pass bundle DataPortal to TradeAlgorithm.run()?

I am trying to run a Zipline back test by calling the run() method of zipline.algorithm.TradeAlgorithm:
algo = TradingAlgorithm(initialize= CandlestickStrategy.initialize,
handle_data= CandlestickStrategy.handle_data,
analyze= CandlestickStrategy.analyze,
data=None,
bundle='quandl')
results = algo.run()
But I'm not sure what or how to pass the data parameter. I have already ingested the data bundle which is called 'quandl'. According to the docs, that parameter should receive a DataPortal instance, but I don't know how to create one of those based on the data I have ingested. What is the best way of doing this/is this necessary?
Essentially my goal is to create a top level 'dashboard' style class which can run multiple back tests using different strategies which exist in separate modules.
Full code (dashboard.py):
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from mpl_finance import candlestick_ohlc
from datetime import datetime, date, tzinfo, timedelta
from dateutil import parser
import pytz
import numpy as np
import talib
import warnings
import logbook
from logbook import Logger
log = Logger('Algorithm')
from zipline.algorithm import TradingAlgorithm
from zipline.api import order_target_percent, order_target, cancel_order, get_open_orders, get_order, get_datetime, record, symbol
from zipline.data import bundles
from zipline.finance import execution
from CandlestickStrategy import CandlestickStrategy
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# Choosing a security and a time horizon
logbook.StderrHandler().push_application()
start = datetime(2014, 9, 1, 0, 0, 0, 0, pytz.utc)
end = datetime(2016, 1, 1, 0, 0, 0, 0, pytz.utc)
#dataPortal = data_portal.DataPortal(asset_finder, trading_calendar, first_trading_day, e
#bundle = bundles.load('quandl',None,start)
algo = TradingAlgorithm(initialize= CandlestickStrategy.initialize,
handle_data= CandlestickStrategy.handle_data,
analyze= CandlestickStrategy.analyze,
data=None,
bundle='quandl')
results = algo.run()
CandleStickStrategy.py:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from mpl_finance import candlestick_ohlc
from zipline.api import order_target_percent, order_target, cancel_order, get_open_orders, get_order, get_datetime, record, symbol
from zipline.finance import execution
from datetime import datetime, date, tzinfo, timedelta
from dateutil import parser
import pytz
import numpy as np
import talib
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
class CandlestickStrategy:
def initialize(context):
print "initializing algorythm..."
context.i = 0
context.asset = symbol('AAL')
def handle_data(context, data):
try:
trailing_window = data.history(context.asset, ['open','high','low','close'], 28, '1d')
except:
return
def analyze(context=None, results=None):
print "Analyze"
Hopefully someone can point me in the right direction.
Thanks
I faced the same issue. When running the trading algorithm manually this way the bundle argument is not evaluated. You need to create the data portal yourself. I manually registered the bundle and created a data_portal to run it:
bundles.register('yahoo-xetra',
csvdir_equities(get_calendar("XETRA"), ["daily"],
'/data/yahoo'),
calendar_name='XETRA')
bundle_data = bundles.load(
'yahoo-xetra',
)
first_trading_day = bundle_data.equity_daily_bar_reader.first_trading_day
data = DataPortal(
bundle_data.asset_finder,
trading_calendar=get_calendar("XETRA"),
first_trading_day=first_trading_day,
equity_minute_reader=bundle_data.equity_minute_bar_reader,
equity_daily_reader=bundle_data.equity_daily_bar_reader,
adjustment_reader=bundle_data.adjustment_reader,
)
Strategy = SimpleAlgorithm(trading_calendar=get_calendar("XETRA"), data_frequency='daily',
start=pd.Timestamp('2017-1-1 08:00:00+0200', tz='Europe/Berlin'),
end=pd.Timestamp('2018-12-27 08:00:00+0200', tz='Europe/Berlin'),
capital_base=10000000,
data_portal=data)