Series.agg() works differently when passing function - pandas

Called a Function inside agg() of Series, from below snippet of code, in first call it's printing int number for variable "a", and in second call it's coming as Series. I am not able to figure it out the reason for this behaviour.
import pandas as pd
ser = pd.Series([1,2,3,4,5])
def find_second_last(a):
print(a)
return a.iloc[-2]
ser.agg(find_second_last)

.iloc with single position without [] will return the int by default
a.iloc[[-2]]# return pd.Series
a.iloc[-2] # return int
a.iloc[1:] # return pd.Series

Related

In a Pandas UDF, is it possible to return a Series containing nested lists?

I would like to create a Pandas UDF that returns a series containing a list of lists. This list represents the normalized pixel values of an image. Is it possible to return this datatype from a Pandas UDF? I tried adding ArrayType(FloatType()) to the Pandas UDF decorator, but am getting the error: could not convert <nested lists with floating point values> with type list: tried to convert to float 32. Would be great to hear your thoughts on this, thanks!
#pandas_udf(ArrayType(FloatType()))
def base64_to_arr(base64_images: pd.Series) -> pd.Series:
def base64_to_arr(img):
img_bytes = base64.b64decode(img)
img_array = np.load(BytesIO(img_bytes))
## Resize
pil_img = Image.fromarray(img_array)
resized_img = pil_img.resize((32, 32))
resized_image_arr = np.array(resized_img)
##
normalized_img = resized_image_arr.astype("float32") / 255
formatted_img = normalized_img.tolist()
return formatted_img
arr_images = base64_images.map(base64_to_arr)
return arr_images

How to call custom function in Pandas

I have defined a custom function to correct outliers of one of my DF column. The function is working as expected, but i am not getting idea how to call this function in DF. Could you please help me in solving this?
Below is my custom function:
def corr_sft_outlier(in_bhk, in_sft):
bhk_band = np.quantile(outlierdf2[outlierdf2.bhk_size==in_bhk]['avg_sft'], (.20,.90))
lower_band = round(bhk_band[0])
upper_band = round(bhk_band[1])
if (in_sft>=lower_band)&(in_sft<=upper_band):
return in_sft
elif (in_sft<lower_band):
return lower_band
elif (in_sft>upper_band):
return upper_band
else:
return None
And i am calling this function in below ways, but both are not working.
outlierdf2[['bhk_size','avg_sft']].apply(corr_sft_outlier)
outlierdf2.apply(corr_sft_outlier(outlierdf2['bhk_size'],outlierdf2['avg_sft']))
Here you go:
outlierdf2['adj_avg_sft'] = df.apply(lambda x: corr_sft_outlier(x['bhk_size'],x['avg_sft']), axis=1)

How to call functions and create a new function

I have created 2 functions that returns two dataframe.I want to create another function and merge dataframe from function1, function2 and manipulate the data there. How can i call the function and merge it together.The way i called doesn't work for me
def func1():
return df1
def func2():
return df2
def fucn3():
func1()
func2()
Your question is not entirely clear but what I think you mean is:
Use merge:
def func3():
df = func1().merge(func2())
#do something with df
return df

how to resize and subtract numpy arrays in c++

I have two numpy 3D-array in python with different height and width. I want to pass them to my C-Extension. How can I resize and subtract them in c++? Please see the comments in the code.
static PyObject *my_func(PyObject *self, PyObject *args)
{
Py_Initialize();
import_array();
PyObject *arr1;
PyObject *arr2;
if(!PyArg_ParseTuple(args, "OO", &arr1, &arr2))
{
return NULL;
}
//How can I do this?
//resize arr1 to [100, 100, 3]
//resize arr2 to [100, 100, 3]
//res = arr1 - arr2
//return res
}
Start by making the desired shape. It's easier to do this as a tuple than a list:
PyObject* shape = Py_BuildValue("iii",100,100,3);
Check this against NULL to ensure do error has occurred and handle if it has.
You can call the numpy resize function on both arrays to resize them. Unless you are certain that the data isn't shared then you need to call numpy.resize rather than the .resize method of the arrays. This involves importing the module and getting the resize attribute:
PyObject* np = PyImport_ImportModule("numpy");
PyObject* resize = PyObject_GetAttrString(np,"resize");
PyObject* resize_result = PyObject_CallFunctionObjArgs(resize,arr1, shape,NULL);
I've omitted all the error checking, which you should do after each stage.
Make sure you decrease the reference counts on the various PyObjects once you don't need them any more.
Use PyNumber_Subtract to do the subtraction (do it on the result from resize).
Addition: A shortcut for calling resize that should avoid most of the intermediates:
PyObject* np = PyImport_ImportModule("numpy");
// error check against null
PyObject* resize_result = PyObject_CallMethod(np,"resize","O(iii)",arr1,100,100,3);
(The "(iii)" creates the shape tuple rather than needing to do it separately.)
If you are certain that arr1 and arr2 are the only owners of the data then you can call the numpy .resize method either by the normal C API function calls or the specific numpy function PyArray_Resize.

return a list from class object

I am using multiprocessing module to generate 35 dataframes. I guess this will save my time. But the problem is that the class does not return anything. I expect the list of dataframes to be returned from self.dflist
Here is how to create dfnames list.
urls=[]
fnames=[]
dfnames=[]
for x in xrange(100,3600,100):
y = str(x)
i = y.zfill(4)
filename='DCHB_Town_Release_'+i+'.xlsx'
url = "http://www.censusindia.gov.in/2011census/dchb/"+filename
urls.append(url)
fnames.append(filename)
dfnames.append((filename, 'DCHB_Town_Release_'+i))
This is the class that uses the dfnames generated by above code.
import pandas as pd
import multiprocessing
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist=list()
self.jobs=list()
self.dfnames=dfnames
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
dfname=pd.read_excel(filename)
self.dflist.append(dfname)
print self.dflist
return self.dflist
def mp(self):
for f,d in self.dfnames:
p = multiprocessing.Process(target=self.dframe_create, args=(f,d))
self.jobs.append(p)
p.start()
#return self.dflist
for j in self.jobs:
j.join()
print '%s.exitcode = %s' % (j.name, j.exitcode)
This class when called like this...
dflist=[]
jobs=[]
x=mydf1(dflist, jobs, dfnames)
y=x.mp()
Prints the self.dflist correctly. But does not return anything.
I can collect all datafarmes sequentially. But in order to save time, I need to use multiple processes simultaneously to generate and add dataframes to a list.
In your case I prefer to write as less code as possible and use Pool:
import pandas as pd
import logging
import multiprocessing
def dframe_create(filename):
try:
return pd.read_excel(filename)
except Exception as e:
logging.error("Something went wrong: %s", e, exc_info=1)
return None
p = multiprocessing.Pool()
excel_files = p.map(dframe_create, dfnames)
for f in excel_files:
if f is not None:
print 'Ready to work'
else:
print ':('
Prints the self.dflist correctly. But does not return anything.
That's because you don't have a return statement in the mp method, e.g.
def mp(self):
...
return self.dflist
It's not entirely clear what you're issue is, however, you have to take some care here in that you can't just pass objects/lists across processes. That's why you have special objects (which lock while they make modifications to a list), that way you don't get tripped up when two processes try to make a change at the same time (and you only get one update).
That is, you have to use multiprocessing's list.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist = multiprocessing.list() # perhaps should be multiprocessing.list(dflist or ())
self.jobs = list()
self.dfnames = dfnames
However you have a bigger problem: the whole point of multiprocessing is that they may run/finish out of order, so keeping two lists like this is doomed to fail. You should use a multiprocessing.dict that way the DataFrame is saved unambiguously with the filename.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dfdict = multiprocessing.dict()
...
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
df = pd.read_excel(filename)
self.dfdict[dfname] = df