When using Pandas to_hdf is it possible to specify a column data type to vlen special_dtype / vlarray for ragged tensors? - pandas

I have a Pandas column which contains numpy arrays or lists of varying size. If I try to convert the dataframe to hdf5 using to_hdf , I get the message that says
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values]
I am guessing this because of the ragged tensors in pandas column. HDpy does have a special datatype for ragged tensors.
http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data
Example here
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('batch', data=yourData, dtype=dt, compression='gzip', compression_opts=9)
So I can convert the pandas df to numpy, and then save each numpy array separately, with the varying length column stored with the special vlen datatype.
I am wondering if there is a way to do this in Pandas.
The following is a minimal example using a small chunk of my data. It downloads and opens a small chunk of the dataframe, and saves it to hdf5
import requests
import pickle
import numpy as np
import pandas as pd
#Download function for google drive
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
#download the google drive file
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF2 = pd.read_pickle('sample.pkl')
sampleDF2.to_hdf( 'pandasList.hdf', 'first', complevel = 9 )
sampleDF2['totalCites2'] = sampleDF2['totalCites2'].apply(lambda x: np.array(x))
sampleDF2.to_hdf( 'pandasNumpy.hdf', 'first', complevel = 9 )
For convenience, here is a colab notebook which has this code
https://colab.research.google.com/drive/1DjiPsN3MbRWP6NnJwvaAhzy66FNbPVA8
Edit:
As hpualj mentioned, Pandas uses Pytables not h5py, so it looks like the question should be how to use vlarray, which is how pytables store variable length arrays.

Related

It is possible to insert image into pandas data frame?

I wanted to save a test image dataset into a pandas data frame. The Panda data frame contains the input image, input image class, and predicted output class.
Do you need something like this?
import pandas as pd
from IPython.core.display import display,HTML
# empty dataframe
df = pd.DataFrame()
# your images
df['images1'] = ['https://a.cdn-hotels.com/gdcs/production180/d124/9dc35ac0-af3d-4cce-a7cf-02132213f43a.jpg?impolicy=fcrop&w=800&h=533&q=medium',
'https://upload.wikimedia.org/wikipedia/commons/2/2b/NYC_Downtown_Manhattan_Skyline_seen_from_Paulus_Hook_2019-12-20_IMG_7347_FRD_%28cropped%29.jpg']
df['images2'] = ['https://post.medicalnewstoday.com/wp-content/uploads/sites/3/2020/02/322868_1100-800x825.jpg',
'https://i.guim.co.uk/img/media/684c9d087dab923db1ce4057903f03293b07deac/205_132_1915_1150/master/1915.jpg?width=1200&height=1200&quality=85&auto=format&fit=crop&s=14a95b5026c1567b823629ba35c40aa0']
display(df) # <-- At this point you have a dataframe with paths of images
# convert your links to html tags
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', None)
image_cols = ['images1', 'images2'] # If you have many columns define which columns will be used to convert to html
# Create the dictionariy to be passed as formatters
format_dict = {}
for image_col in image_cols:
format_dict[image_col] = path_to_image_html
display(HTML(df.to_html(escape=False ,formatters=format_dict)))

How to convert a Pydantic model in FastAPI to a Pandas DataFrame?

I am trying to convert a Pydantic model to a Pandas DataFrame, but I am getting various errors.
Here is the code:
from typing import Optional
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import sklearn
import pandas as pd
import numpy as np
class Userdata(BaseModel):
current_res_month_dec: Optional[int] = 0
current_res_month_nov: Optional[int] = 0
async def return_recurrent_user_predictions_gb(user_data: Userdata):
empty_dataframe = pd.DataFrame([Userdata(**{
'current_res_month_dec': user_data.current_res_month_dec,
'current_res_month_nov': user_data.current_res_month_nov})], ignore_index=True)
This is the DataFrame that is returned when trying to execute it through /docs in my local environment:
Response body
Download
{
"0": {
"0": [
"current_res_month_dec",
0
]
},
"1": {
"0": [
"current_res_month_nov",
0
]
}
but if I try to use this DataFrame for a prediction:
model_has_afternoon = pickle.load(open('./models/model_gbclf_prob_current_product_has_afternoon.pickle', 'rb'))
result_afternoon = model_has_afternoon.predict_proba(empty_dataframe)[:, 1]
I get this error:
ValueError: setting an array element with a sequence.
I have tried building my own DataFrame before, and the predictions should work with a DataFrame.
You first need to convert the Pydantic model into a dictionary using Pydantic's dict() method. Note that other methods, such as Python's dict() function and .__dict__ attribute, have been found to be faster alternatives to Pydantic's dict() method (see this answer). However, since you are using a Pydantic model, it might be best to use Pydantic's dict() method, and then pass the dictionary to pandas.DataFrame() surrounded by square brackets; for example, pd.DataFrame([data.dict()]). As described in this answer, this approach can be used when you need the keys of the passed dict to be the columns and the values to be the rows. If you need to specify a different orientation, you can also use pandas.DataFrame.from_dict().
Working Example
from typing import Optional
from fastapi import FastAPI
from pydantic import BaseModel
import pandas as pd
app = FastAPI()
class Userdata(BaseModel):
col1: Optional[int] = 0
col2: Optional[int] = 0
col3: str = "foo"
#app.post('/submit')
def submit_data(data: Userdata):
df = pd.DataFrame([data.dict()])
return "Success"
More Options
As you mentioned that you would like to use the DataFrame for Machine Learning predictions, it should be noted that there are a few other options to pass the data to predict() and predict_proba() functions that do not require to create a DataFrame. These options include:
model.predict([[data.col1, data.col2, data.col3]])
and
model.predict([list(data.dict().values())])
Please have a look at this answer for more details. In case you would also need to respond back to the client with a DataFrame in JSON format, please take a look here.

Combining CSV of different shapes into one CSV

I have CSVs of different number of rows and columns. I would like to create one large CSV where all the CSV data are stacked directly on top of each other, aligned by the first column. I tried the script below with limited success; b which is an empty array does not hold the data from the previous loops.
from os import walk
import sys
import numpy as np
filenames= []
dirpath = []
filtered = []
original = []
f = []
b = np.empty([2, 2])
for (dirpath, dirnames, filenames) in walk("C:\\Users\\dkim1\\Python Scripts\\output"):
f.extend(dirnames)
print(f)
for names in f:
print(names)
df = np.genfromtxt('C:\\Users\\dkim1\\Python Scripts\\output\\' + names + '\\replies.csv', dtype =None, delimiter = ',', skip_header=1, names=True)
b = np.column_stack(df)
print(b)
Have you tried pd.concat()?
import os
import pandas as pd
# just used a single dir for example simplicity, rather than os.walk()
root_dir = "your directory path here"
file_names = os.listdir(root_dir)
cat_list=[]
for names in file_names:
df = pd.read_csv(os.path.join(root_dir, names), delimiter = ',', header=None)
cat_list.append(df)
concatted_df = pd.concat(cat_list)

Arranging data into lists from url

The following code is written in python 2. How can I write it in python 3? thanks
import urllib2
import sys
#read data from uci data repository
target_url = ("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
data = urllib2.urlopen(target_url)
#arrange data into list for labels and list of lists for attributes
xList = []
labels = []
for line in data:
#split on comma
row = line.strip().split(",")
xList.append(row)
You can use the requests library of Python 3.
import requests
data = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data").text
for line in data.split('\n'):
row = line.strip().split(",")
xList.append(row)

How to specify column type(I need string) using pandas.to_csv method in Python?

import pandas as pd
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data,type = str)
Df.to_csv("path/to/save.csv")
There result I've obtained seems as this
To achieve such result it will be easier to export directly to xlsx file, even without setting dtype of DataFrame.
import pandas as pd
writer = pd.ExcelWriter('path/to/save.xlsx')
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data)
Df.to_excel(writer,"Sheet1")
writer.save()
I've tried also some other methods like prepending apostrophe or quoting all fields with ", but it gave no effect.