How to convert a Pydantic model in FastAPI to a Pandas DataFrame? - pandas

I am trying to convert a Pydantic model to a Pandas DataFrame, but I am getting various errors.
Here is the code:
from typing import Optional
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import sklearn
import pandas as pd
import numpy as np
class Userdata(BaseModel):
current_res_month_dec: Optional[int] = 0
current_res_month_nov: Optional[int] = 0
async def return_recurrent_user_predictions_gb(user_data: Userdata):
empty_dataframe = pd.DataFrame([Userdata(**{
'current_res_month_dec': user_data.current_res_month_dec,
'current_res_month_nov': user_data.current_res_month_nov})], ignore_index=True)
This is the DataFrame that is returned when trying to execute it through /docs in my local environment:
Response body
Download
{
"0": {
"0": [
"current_res_month_dec",
0
]
},
"1": {
"0": [
"current_res_month_nov",
0
]
}
but if I try to use this DataFrame for a prediction:
model_has_afternoon = pickle.load(open('./models/model_gbclf_prob_current_product_has_afternoon.pickle', 'rb'))
result_afternoon = model_has_afternoon.predict_proba(empty_dataframe)[:, 1]
I get this error:
ValueError: setting an array element with a sequence.
I have tried building my own DataFrame before, and the predictions should work with a DataFrame.

You first need to convert the Pydantic model into a dictionary using Pydantic's dict() method. Note that other methods, such as Python's dict() function and .__dict__ attribute, have been found to be faster alternatives to Pydantic's dict() method (see this answer). However, since you are using a Pydantic model, it might be best to use Pydantic's dict() method, and then pass the dictionary to pandas.DataFrame() surrounded by square brackets; for example, pd.DataFrame([data.dict()]). As described in this answer, this approach can be used when you need the keys of the passed dict to be the columns and the values to be the rows. If you need to specify a different orientation, you can also use pandas.DataFrame.from_dict().
Working Example
from typing import Optional
from fastapi import FastAPI
from pydantic import BaseModel
import pandas as pd
app = FastAPI()
class Userdata(BaseModel):
col1: Optional[int] = 0
col2: Optional[int] = 0
col3: str = "foo"
#app.post('/submit')
def submit_data(data: Userdata):
df = pd.DataFrame([data.dict()])
return "Success"
More Options
As you mentioned that you would like to use the DataFrame for Machine Learning predictions, it should be noted that there are a few other options to pass the data to predict() and predict_proba() functions that do not require to create a DataFrame. These options include:
model.predict([[data.col1, data.col2, data.col3]])
and
model.predict([list(data.dict().values())])
Please have a look at this answer for more details. In case you would also need to respond back to the client with a DataFrame in JSON format, please take a look here.

Related

How do you add dataclasses as valid index values to a plotly chart?

I am trying to switch from the matplotlib pandas plotting backend to plotly. However, I am being held back by a common occurrence of this error:
TypeError: Object of type Quarter is not JSON serializable
Where Quarter is a dataclass in my codebase.
For a minimal example, consider:
#dataclass
class Foo:
val:int
df = pd.DataFrame({'x': [Foo(i) for i in range(10)], 'y':list(range(10))})
df.plot.scatter(x='x', y='y')
As expected, the above returns:
TypeError: Object of type Foo is not JSON serializable
Now, I don't expect plotly to be magical, but adding a __float__ magic method allows the Foo objects to be used with the matplotlib backend:
# This works
#dataclass
class Foo:
val:int
def __float__(self):
return float(self.val)
df = pd.DataFrame({'x': [Foo(i) for i in range(10)], 'y':list(range(10))})
df.plot.scatter(x='x', y='y')
How can I update my dataclass to allow for it to be used with the plotly backend?
You can get pandas to cast to float before invoking plotting backend.
from dataclasses import dataclass
import pandas as pd
#dataclass
class Foo:
val:int
def __float__(self):
return float(self.val)
df = pd.DataFrame({'x': [Foo(i) for i in range(10)], 'y':list(range(10))})
df["x"].astype(float)
pd.options.plotting.backend = "plotly"
df.assign(x=lambda d: d["x"].astype(float)).plot.scatter(x='x', y='y')
monkey patching
if you don't want to change code, you can monkey patch the plotly implementation of pandas plotting API
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#plotting-backends
from dataclasses import dataclass
import pandas as pd
import wrapt, json
import plotly
#wrapt.patch_function_wrapper(plotly, 'plot')
def new_plot(wrapped, instance, args, kwargs):
try:
json.dumps(args[0][kwargs["x"]])
except TypeError:
args[0][kwargs["x"]] = args[0][kwargs["x"]].astype(float)
return wrapped(*args, **kwargs)
#dataclass
class Foo:
val:int
def __float__(self):
return float(self.val)
df = pd.DataFrame({'x': [Foo(i) for i in range(10)], 'y':list(range(10))})
df["x"].astype(float)
pd.options.plotting.backend = "plotly"
df.plot.scatter(x='x', y='y')

Plotly chart percentage with smileys

I would like o add a plot figure based on smileys like this one:
dat will come from a dataframe pandas : dataframe.value_counts(normalize=True)
Can some one give me some clues.
use colorscale in normal way for a heatmap
use anotation_text to assign an emoji to a value
import plotly.figure_factory as ff
import plotly.graph_objects as go
import pandas as pd
import numpy as np
df = pd.DataFrame([[j*10+i for i in range(10)] for j in range(10)])
e=["😃","🙂","😐","☚ī¸"]
fig = go.Figure(ff.create_annotated_heatmap(
z=df.values, colorscale="rdylgn", reversescale=False,
annotation_text=np.select([df.values>75, df.values>50, df.values>25, df.values>=0], e),
))
fig.update_annotations(font_size=25)
# allows emoji to use background color
fig.update_annotations(opacity=0.7)
update coloured emoji
fundamentally you need emojicons that can accept colour styling
for this I switched to Font Awesome. This then also requires switching to dash, plotly's cousin so that external CSS can be used (to use FA)
then build a dash HTML table applying styling logic for picking emoticon and colour
from jupyter_dash import JupyterDash
import dash_html_components as html
import pandas as pd
import branca.colormap
# Load Data
df = pd.DataFrame([[j*10+i for i in range(10)] for j in range(10)])
external_stylesheets = [{
'href': 'https://use.fontawesome.com/releases/v5.8.1/css/all.css',
'rel': 'stylesheet', 'crossorigin': 'anonymous',
'integrity': 'sha384-50oBUHEmvpQ+1lW4y57PTFmhCaXp0ML5d60M1M7uH2+nqUivzIebhndOJK28anvf',
}]
# possibly could use a a different library for this - simple way to map a value to a colormap
cm = branca.colormap.LinearColormap(["red","yellow","green"], vmin=0, vmax=100, caption=None)
def mysmiley(v):
sm = ["far fa-grin", "far fa-smile", "far fa-meh", "far fa-frown"]
return html.Span(className=sm[3-(v//25)], style={"color":cm(v),"font-size": "2em"})
# Build App
app = JupyterDash(__name__, external_stylesheets=external_stylesheets)
app.layout = html.Div([
html.Table([html.Tr([html.Td(mysmiley(c)) for c in r]) for r in df.values])
])
# Run app and display result inline in the notebook
app.run_server(mode='inline')

Accessing methods within a class from bokeh FileInput widget

I am working on a Bokeh serve UI and am running into trouble interfacing a class (and its methods) with the FileInput widget. I am using a class (in this example, called "EIS_data") which, when instantiated, loads a file using pd.read_csv. The EIS_data class also has a method to plot the data in a particular way, and I'd like to be able to load the pandas dataframe and call and manipulate the data using the methods already in place in the class.
So far, I have been able to load the data successfully using the FileInput widget, but I can't figure out how to access the dataframe again once it's loaded in. In a standalone Jupyter notebook, I could run d = EIS_data("filename") and then ```d.plot''' to load the data into a pandas dataframe and then plot it according to the method defined in the EIS_data class, but I can't figure out how to replicate this in the UI code once the data are loaded using the FileInput widget.
Is there a way I can interface this with Bokeh widgets, such that I could simply add d.plot() to curdoc()? I have found a workaround using ColumnDataSource, but it seems a shame to redefine plotting methods and data handling when they are already defined in the class. Below are minimal working examples of the UI code and the class definition.
UI Code:
import numpy as np
import pandas as pd
from eis_analysis_trimmed import EIS_data
import bokeh
from bokeh.io import curdoc
from bokeh import layouts
from bokeh.layouts import column,row,gridplot
from bokeh.plotting import figure
from bokeh.models import *
import base64
import io
## Instantiate the EIS_data class for loading data
def load_data(f):
return EIS_data(f)
## updater function called to load data with FileInput widget
## Must be decoded using base64
def load_file(attr, old, new):
decoded = base64.b64decode(new)
d = io.BytesIO(decoded)
dat = load_data(d)
print(dat.df)
print(dat)
print("EIS Data Uploaded Successfully")
return dat
f_load = Paragraph(text="""Load Data""",height=15)
f = FileInput()
f.on_change('value',load_file)
curdoc().add_root(column(f))
and here is the EIS_data class:
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
from bokeh.plotting import figure, show
from bokeh.models import LinearAxis, Range1d
from bokeh.resources import INLINE
import bokeh.io
#locally include javascript dependencies in html
bokeh.io.output_notebook(INLINE)
class EIS_data:
def __init__(self, file_name, delimiter='\t',
header=0, f_low=None, f_high=None):
#load eis data into a pandas dataframe
eis_data = pd.read_csv(file_name, delimiter=delimiter, header=header)
#iterate through all of the columns and check to see
#if all of the values in that column are null
#if they are, then remove that column
for c in eis_data.columns:
if eis_data[c].isnull().all():
eis_data = eis_data.drop([c], axis=1)
#make sure that the data are imported as floats and not strings
eis_data = eis_data[['freq/Hz', 'Re(Z)/Ohm', '-Im(Z)/Ohm']]
eis_data['freq/Hz'] = pd.to_numeric(eis_data['freq/Hz'])
eis_data['Re(Z)/Ohm'] = pd.to_numeric(eis_data['Re(Z)/Ohm'])
eis_data['-Im(Z)/Ohm'] = pd.to_numeric(eis_data['-Im(Z)/Ohm'])
self.df = eis_data.sort_values(by='freq/Hz')
def plot(self, fit_vals = None):
plot = figure(title="Nyquist Plot",
x_axis_label='Re(Z) Ohm',
y_axis_label='-Im(Z) Ohm',
plot_width=600,
plot_height=600)
plot.circle(self.df['Re(Z)/Ohm'], self.df['-Im(Z)/Ohm'],
size=7, color='navy', name='Data')
return plot
EDIT: Adding the workaround using ColumnDataSource
from bokeh.layouts import column
from bokeh.plotting import figure
from bokeh.models import *
from bokeh.models.widgets import FileInput
import base64
import io
from eis_analysis2 import EIS_data
# Instantiate the EIS_data class for loading data
def load_data(data):
return EIS_data(data)
# updater function called to load data with FileInput widget
# Must be decoded using base64
def load_file(attr, old, new):
decoded = base64.b64decode(new)
d = io.BytesIO(decoded)
dat = load_data(d)
dat_df = dat.df
# Replace plot data with data from newly-loaded file
source.data = dict(freq=dat_df[dat_df.columns[0]], reZ=dat_df[dat_df.columns[1]], imZ=dat_df[dat_df.columns[2]])
#phase,mag = bode_calc(reZ,imZ)
print(dat_df)
print("EIS Data Uploaded Successfully")
# Create Column Data Source that will be used by the plot
source = ColumnDataSource(data=dict(freq=[], reZ=[], imZ=[]))
##Make the nyquist plot
nyq_plot = figure(title="Nyquist Plot",
x_axis_label='Re(Z) Ohm',
y_axis_label='-Im(Z) Ohm',
plot_width=600,
plot_height=600)
nyq_plot.circle(x="reZ", y="imZ",source=source,size=7, color='navy', name='Data')
f = FileInput()
f.on_change('value', load_file)
layout = column(f, nyq_plot)
curdoc().add_root(layout)

When using Pandas to_hdf is it possible to specify a column data type to vlen special_dtype / vlarray for ragged tensors?

I have a Pandas column which contains numpy arrays or lists of varying size. If I try to convert the dataframe to hdf5 using to_hdf , I get the message that says
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values]
I am guessing this because of the ragged tensors in pandas column. HDpy does have a special datatype for ragged tensors.
http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data
Example here
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('batch', data=yourData, dtype=dt, compression='gzip', compression_opts=9)
So I can convert the pandas df to numpy, and then save each numpy array separately, with the varying length column stored with the special vlen datatype.
I am wondering if there is a way to do this in Pandas.
The following is a minimal example using a small chunk of my data. It downloads and opens a small chunk of the dataframe, and saves it to hdf5
import requests
import pickle
import numpy as np
import pandas as pd
#Download function for google drive
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
#download the google drive file
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF2 = pd.read_pickle('sample.pkl')
sampleDF2.to_hdf( 'pandasList.hdf', 'first', complevel = 9 )
sampleDF2['totalCites2'] = sampleDF2['totalCites2'].apply(lambda x: np.array(x))
sampleDF2.to_hdf( 'pandasNumpy.hdf', 'first', complevel = 9 )
For convenience, here is a colab notebook which has this code
https://colab.research.google.com/drive/1DjiPsN3MbRWP6NnJwvaAhzy66FNbPVA8
Edit:
As hpualj mentioned, Pandas uses Pytables not h5py, so it looks like the question should be how to use vlarray, which is how pytables store variable length arrays.

Python- Exporting a Dataframe into a csv

I'm trying to write a dataframe file to a csv using pandas. I'm getting the following error AttributeError: 'list' object has no attribute 'to_csv'. I believe I'm writing the syntax correctly, but could anyone point out where my syntax is incorrect in trying to write a dataframe to a csv?
This is link the link of the file: https://s22.q4cdn.com/351912490/files/doc_financials/quarter_spanish/2018/2018.02.25_Release-4Q18_ingl%C3%A9s.pdf
Thanks for your time!
import tabula
from tabula import read_pdf
import pandas as pd
from pandas import read_json, read_csv
a = read_pdf(r"C:\Users\Emege\Desktop\micro 1 true\earnings_release.pdf",\
multiple_tables= True, pages = 15, output_format = "csv",\
)
print(a)
a.to_csv("a.csv",header = False, index = False, encoding = "utf-8")
enter image description here